HIGHER EDUCATION EXPLORATORY DATA ANALYSIS AND VISUALIZATION

============

INTRODUCTION

============

============ QUESTIONS ============

The questions that I’m trying to answer with this project.

  1. What are the trends in higher-educational enrollment?
  2. Are the trends regional or discipline specific?
  3. Where should higher-education content firms invest in salesforce and content development?

============ DATA SOURCES ============

I used data from the Integrated Postsecondary Education Data System (IPEDS), which is available at https://nces.ed.gov/ipeds/use-the-data. IPEDS has data from a variety of surveys but I used the following three surveys, accessible from the drop-down menu on the website.

Fall enrollment (EFFY). This survey reports the number of enrollments in the fall semester for each responding school, college, university, etc. We use this for overall trends and also for regional trends.

Completions. This reports the number of degree completions for each institution by department (math, biology, English, etc.) and also by subdiscipline (so for mathematics the data splits into applied math, pure math, math-econ, etc.). We use this to detect trends at the level of academic discipline.

Institutional characteristics - directory information (HD). This records the region for each school. We use this to detect regional trends.

============ CONCLUSIONS ============

The answers (see below for the data analysis supporting these):

  1. Overall enrollment is down from 2013-2018. If you divide the enrollment data by gender, men’s enrollment has declined much more sharply than women’s enrollment. The data shows a slight recovery in enrollments happening around 2018, but the missing 2019 data might not suppor this, and there is the large issue of the pandemic affecting enrollment starting in 2020 

Somewhat surprisingly, there is still growth in degree completions for many disciplines, in spite of the overall downward trend in enrollment. I’m not sure how to account for this, but my guess is that in any year a large number of students enroll in college who will eventually drop out. It seems possible that downward trend in enrollment trends could eat into this group of students first: they might have less money for college, be on the fence about whether to go to college versus enter the workforce, have less preparation for college, etc. Any factor that would later contribute to a student dropping out of college might plausibly lead to them not enrolling in the first place, particularly during a time of economic uncertainty (or even during a time of economic growth, when non-degree-specific jobs are available). If this group is large enough, they could account for diminishing enrollments while other students, not as apt to drop out of college after enrolling, drive a smaller upward trend in degree completions for many academic subdisciplines.

An alternate explanation is that some universities have lowered their academic standards in response to diminishing enrollments, in a bid to lose fewer students before degree completion. With the given data I am uncertain of how to choose which of these (or both, or neither) is the correct explanation.

  1. The trends are apparently region and discipline-specific. Regionally the Far West region has seen growth and the Great Lakes region has seen decrease in enrollments, just to name two examples. At the disciplinary level, there has been dramatic decline in History degree completions accompanied by a rise in completions of technical and Computer Science degrees.

  2. Regionally, the Southeast and Far West are among the fastest growing areas. Any regional salesforce should be directed there. At the disciplinary level, Computer Science and Mathematics (including Statistics) are both growing very quickly. Within the Mathematics degree completion data we see a trend toward applied mathematics and statistics, both areas where content creation should be directed. There are several other disciplines that have seen strong growth, particularly scientific technician degrees.

============ DATA ANALYSIS ============

Here we load a few standard tools for working with dataframes and visualizations in R.

library(dplyr)
Registered S3 method overwritten by 'dplyr':
  method           from
  print.rowwise_df     

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.2.1     ✔ purrr   0.3.3
✔ tibble  2.1.3     ✔ stringr 1.4.0
✔ tidyr   1.0.2     ✔ forcats 0.4.0
✔ readr   1.3.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(reshape2)

Attaching package: ‘reshape2’

The following object is masked from ‘package:tidyr’:

    smiths

First, we want to load the data (stored on the disk as .csv files) into some data frames. We grab the EFFY and Completions surveys for each of the years 2010-2018. We only need one instance of the HD survey because this survey only contains regional characteristics.

df_effy_18 <- read_csv('effy2018.csv')
Parsed with column specification:
cols(
  .default = col_double(),
  XEYTOTLT = col_character(),
  XEYTOTLM = col_character(),
  XEYTOTLW = col_character(),
  XEFYAIAT = col_character(),
  XEFYAIAM = col_character(),
  XEFYAIAW = col_character(),
  XEFYASIT = col_character(),
  XEFYASIM = col_character(),
  XEFYASIW = col_character(),
  XEFYBKAT = col_character(),
  XEFYBKAM = col_character(),
  XEFYBKAW = col_character(),
  XEFYHIST = col_character(),
  XEFYHISM = col_character(),
  XEFYHISW = col_character(),
  XEFYNHPT = col_character(),
  XEFYNHPM = col_character(),
  XEFYNHPW = col_character(),
  XEFYWHIT = col_character(),
  XEFYWHIM = col_character()
  # ... with 10 more columns
)
See spec(...) for full column specifications.
df_effy_17 <- read_csv('effy2017.csv')
Parsed with column specification:
cols(
  .default = col_double(),
  XEYTOTLT = col_character(),
  XEYTOTLM = col_character(),
  XEYTOTLW = col_character(),
  XEFYAIAT = col_character(),
  XEFYAIAM = col_character(),
  XEFYAIAW = col_character(),
  XEFYASIT = col_character(),
  XEFYASIM = col_character(),
  XEFYASIW = col_character(),
  XEFYBKAT = col_character(),
  XEFYBKAM = col_character(),
  XEFYBKAW = col_character(),
  XEFYHIST = col_character(),
  XEFYHISM = col_character(),
  XEFYHISW = col_character(),
  XEFYNHPT = col_character(),
  XEFYNHPM = col_character(),
  XEFYNHPW = col_character(),
  XEFYWHIT = col_character(),
  XEFYWHIM = col_character()
  # ... with 10 more columns
)
See spec(...) for full column specifications.
df_effy_16 <- read_csv('effy2016.csv')
Parsed with column specification:
cols(
  .default = col_double(),
  XEYTOTLT = col_character(),
  XEYTOTLM = col_character(),
  XEYTOTLW = col_character(),
  XEFYAIAT = col_character(),
  XEFYAIAM = col_character(),
  XEFYAIAW = col_character(),
  XEFYASIT = col_character(),
  XEFYASIM = col_character(),
  XEFYASIW = col_character(),
  XEFYBKAT = col_character(),
  XEFYBKAM = col_character(),
  XEFYBKAW = col_character(),
  XEFYHIST = col_character(),
  XEFYHISM = col_character(),
  XEFYHISW = col_character(),
  XEFYNHPT = col_character(),
  XEFYNHPM = col_character(),
  XEFYNHPW = col_character(),
  XEFYWHIT = col_character(),
  XEFYWHIM = col_character()
  # ... with 10 more columns
)
See spec(...) for full column specifications.
df_effy_15 <- read_csv('effy2015.csv')
Parsed with column specification:
cols(
  .default = col_double(),
  XEYTOTLT = col_character(),
  XEYTOTLM = col_character(),
  XEYTOTLW = col_character(),
  XEFYAIAT = col_character(),
  XEFYAIAM = col_character(),
  XEFYAIAW = col_character(),
  XEFYASIT = col_character(),
  XEFYASIM = col_character(),
  XEFYASIW = col_character(),
  XEFYBKAT = col_character(),
  XEFYBKAM = col_character(),
  XEFYBKAW = col_character(),
  XEFYHIST = col_character(),
  XEFYHISM = col_character(),
  XEFYHISW = col_character(),
  XEFYNHPT = col_character(),
  XEFYNHPM = col_character(),
  XEFYNHPW = col_character(),
  XEFYWHIT = col_character(),
  XEFYWHIM = col_character()
  # ... with 10 more columns
)
See spec(...) for full column specifications.
df_effy_14 <- read_csv('effy2014.csv')
Parsed with column specification:
cols(
  .default = col_double(),
  XEYTOTLT = col_character(),
  XEYTOTLM = col_character(),
  XEYTOTLW = col_character(),
  XEFYAIAT = col_character(),
  XEFYAIAM = col_character(),
  XEFYAIAW = col_character(),
  XEFYASIT = col_character(),
  XEFYASIM = col_character(),
  XEFYASIW = col_character(),
  XEFYBKAT = col_character(),
  XEFYBKAM = col_character(),
  XEFYBKAW = col_character(),
  XEFYHIST = col_character(),
  XEFYHISM = col_character(),
  XEFYHISW = col_character(),
  XEFYNHPT = col_character(),
  XEFYNHPM = col_character(),
  XEFYNHPW = col_character(),
  XEFYWHIT = col_character(),
  XEFYWHIM = col_character()
  # ... with 10 more columns
)
See spec(...) for full column specifications.
df_effy_13 <- read_csv('effy2013.csv')
Parsed with column specification:
cols(
  .default = col_double(),
  XEYTOTLT = col_character(),
  XEYTOTLM = col_character(),
  XEYTOTLW = col_character(),
  XEFYAIAT = col_character(),
  XEFYAIAM = col_character(),
  XEFYAIAW = col_character(),
  XEFYASIT = col_character(),
  XEFYASIM = col_character(),
  XEFYASIW = col_character(),
  XEFYBKAT = col_character(),
  XEFYBKAM = col_character(),
  XEFYBKAW = col_character(),
  XEFYHIST = col_character(),
  XEFYHISM = col_character(),
  XEFYHISW = col_character(),
  XEFYNHPT = col_character(),
  XEFYNHPM = col_character(),
  XEFYNHPW = col_character(),
  XEFYWHIT = col_character(),
  XEFYWHIM = col_character()
  # ... with 10 more columns
)
See spec(...) for full column specifications.
df_effy_12 <- read_csv('effy2012.csv')
Parsed with column specification:
cols(
  .default = col_double(),
  XEYTOTLT = col_character(),
  XEYTOTLM = col_character(),
  XEYTOTLW = col_character(),
  XEFYAIAT = col_character(),
  XEFYAIAM = col_character(),
  XEFYAIAW = col_character(),
  XEFYASIT = col_character(),
  XEFYASIM = col_character(),
  XEFYASIW = col_character(),
  XEFYBKAT = col_character(),
  XEFYBKAM = col_character(),
  XEFYBKAW = col_character(),
  XEFYHIST = col_character(),
  XEFYHISM = col_character(),
  XEFYHISW = col_character(),
  XEFYNHPT = col_character(),
  XEFYNHPM = col_character(),
  XEFYNHPW = col_character(),
  XEFYWHIT = col_character(),
  XEFYWHIM = col_character()
  # ... with 10 more columns
)
See spec(...) for full column specifications.
df_effy_11 <- read_csv('effy2011.csv')
Parsed with column specification:
cols(
  .default = col_double(),
  XEYTOTLT = col_character(),
  XEYTOTLM = col_character(),
  XEYTOTLW = col_character(),
  XEFYAIAT = col_character(),
  XEFYAIAM = col_character(),
  XEFYAIAW = col_character(),
  XEFYASIT = col_character(),
  XEFYASIM = col_character(),
  XEFYASIW = col_character(),
  XEFYBKAT = col_character(),
  XEFYBKAM = col_character(),
  XEFYBKAW = col_character(),
  XEFYHIST = col_character(),
  XEFYHISM = col_character(),
  XEFYHISW = col_character(),
  XEFYNHPT = col_character(),
  XEFYNHPM = col_character(),
  XEFYNHPW = col_character(),
  XEFYWHIT = col_character(),
  XEFYWHIM = col_character()
  # ... with 10 more columns
)
See spec(...) for full column specifications.
df_effy_10 <- read_csv('effy2010.csv')
Parsed with column specification:
cols(
  .default = col_double(),
  XEYNRALM = col_character(),
  XEYNRALW = col_character(),
  XFYRAC03 = col_character(),
  XFYRAC04 = col_character(),
  XFYRAC05 = col_character(),
  XFYRAC06 = col_character(),
  XFYRAC07 = col_character(),
  XFYRAC08 = col_character(),
  XFYRAC09 = col_character(),
  XFYRAC10 = col_character(),
  XFYRAC11 = col_character(),
  XFYRAC12 = col_character(),
  XEYUNKNM = col_character(),
  XEYUNKNW = col_character(),
  XEYTOTLM = col_character(),
  XEYTOTLW = col_character(),
  XEYNRALT = col_character(),
  XFYRAC18 = col_character(),
  XFYRAC19 = col_character(),
  XFYRAC20 = col_character()
  # ... with 40 more columns
)
See spec(...) for full column specifications.
df_completions_18 = read_csv('c2018_a.csv')
Parsed with column specification:
cols(
  .default = col_character(),
  UNITID = col_double(),
  MAJORNUM = col_double(),
  CTOTALT = col_double(),
  CTOTALM = col_double(),
  CTOTALW = col_double(),
  CAIANT = col_double(),
  CAIANM = col_double(),
  CAIANW = col_double(),
  CASIAT = col_double(),
  CASIAM = col_double(),
  CASIAW = col_double(),
  CBKAAT = col_double(),
  CBKAAM = col_double(),
  CBKAAW = col_double(),
  CHISPT = col_double(),
  CHISPM = col_double(),
  CHISPW = col_double(),
  CNHPIT = col_double(),
  CNHPIM = col_double(),
  CNHPIW = col_double()
  # ... with 12 more columns
)
See spec(...) for full column specifications.
df_completions_17 = read_csv('c2017_a.csv')
Parsed with column specification:
cols(
  .default = col_character(),
  UNITID = col_double(),
  MAJORNUM = col_double(),
  CTOTALT = col_double(),
  CTOTALM = col_double(),
  CTOTALW = col_double(),
  CAIANT = col_double(),
  CAIANM = col_double(),
  CAIANW = col_double(),
  CASIAT = col_double(),
  CASIAM = col_double(),
  CASIAW = col_double(),
  CBKAAT = col_double(),
  CBKAAM = col_double(),
  CBKAAW = col_double(),
  CHISPT = col_double(),
  CHISPM = col_double(),
  CHISPW = col_double(),
  CNHPIT = col_double(),
  CNHPIM = col_double(),
  CNHPIW = col_double()
  # ... with 12 more columns
)
See spec(...) for full column specifications.
df_completions_16 = read_csv('c2016_a.csv')
Parsed with column specification:
cols(
  .default = col_character(),
  UNITID = col_double(),
  MAJORNUM = col_double(),
  CTOTALT = col_double(),
  CTOTALM = col_double(),
  CTOTALW = col_double(),
  CAIANT = col_double(),
  CAIANM = col_double(),
  CAIANW = col_double(),
  CASIAT = col_double(),
  CASIAM = col_double(),
  CASIAW = col_double(),
  CBKAAT = col_double(),
  CBKAAM = col_double(),
  CBKAAW = col_double(),
  CHISPT = col_double(),
  CHISPM = col_double(),
  CHISPW = col_double(),
  CNHPIT = col_double(),
  CNHPIM = col_double(),
  CNHPIW = col_double()
  # ... with 12 more columns
)
See spec(...) for full column specifications.
df_completions_15 = read_csv('c2015_a.csv')
Parsed with column specification:
cols(
  .default = col_character(),
  UNITID = col_double(),
  MAJORNUM = col_double(),
  CTOTALT = col_double(),
  CTOTALM = col_double(),
  CTOTALW = col_double(),
  CAIANT = col_double(),
  CAIANM = col_double(),
  CAIANW = col_double(),
  CASIAT = col_double(),
  CASIAM = col_double(),
  CASIAW = col_double(),
  CBKAAT = col_double(),
  CBKAAM = col_double(),
  CBKAAW = col_double(),
  CHISPT = col_double(),
  CHISPM = col_double(),
  CHISPW = col_double(),
  CNHPIT = col_double(),
  CNHPIM = col_double(),
  CNHPIW = col_double()
  # ... with 12 more columns
)
See spec(...) for full column specifications.
df_completions_14 = read_csv('c2014_a.csv')
Parsed with column specification:
cols(
  .default = col_character(),
  UNITID = col_double(),
  MAJORNUM = col_double(),
  CTOTALT = col_double(),
  CTOTALM = col_double(),
  CTOTALW = col_double(),
  CAIANT = col_double(),
  CAIANM = col_double(),
  CAIANW = col_double(),
  CASIAT = col_double(),
  CASIAM = col_double(),
  CASIAW = col_double(),
  CBKAAT = col_double(),
  CBKAAM = col_double(),
  CBKAAW = col_double(),
  CHISPT = col_double(),
  CHISPM = col_double(),
  CHISPW = col_double(),
  CNHPIT = col_double(),
  CNHPIM = col_double(),
  CNHPIW = col_double()
  # ... with 12 more columns
)
See spec(...) for full column specifications.
df_completions_13 = read_csv('c2013_a.csv')
Parsed with column specification:
cols(
  .default = col_character(),
  UNITID = col_double(),
  MAJORNUM = col_double(),
  CTOTALT = col_double(),
  CTOTALM = col_double(),
  CTOTALW = col_double(),
  CAIANT = col_double(),
  CAIANM = col_double(),
  CAIANW = col_double(),
  CASIAT = col_double(),
  CASIAM = col_double(),
  CASIAW = col_double(),
  CBKAAT = col_double(),
  CBKAAM = col_double(),
  CBKAAW = col_double(),
  CHISPT = col_double(),
  CHISPM = col_double(),
  CHISPW = col_double(),
  CNHPIT = col_double(),
  CNHPIM = col_double(),
  CNHPIW = col_double()
  # ... with 12 more columns
)
See spec(...) for full column specifications.
df_completions_12 = read_csv('c2012_a.csv')
Parsed with column specification:
cols(
  .default = col_double(),
  CIPCODE = col_character(),
  AWLEVEL = col_character(),
  XCTOTALT = col_character(),
  XCTOTALM = col_character(),
  XCTOTALW = col_character(),
  XCAIANT = col_character(),
  XCAIANM = col_character(),
  XCAIANW = col_character(),
  XCASIAT = col_character(),
  XCASIAM = col_character(),
  XCASIAW = col_character(),
  XCBKAAT = col_character(),
  XCBKAAM = col_character(),
  XCBKAAW = col_character(),
  XCHISPT = col_character(),
  XCHISPM = col_character(),
  XCHISPW = col_character(),
  XCNHPIT = col_character(),
  XCNHPIM = col_character(),
  XCNHPIW = col_character()
  # ... with 12 more columns
)
See spec(...) for full column specifications.
df_completions_11 = read_csv('c2011_a.csv')
Parsed with column specification:
cols(
  .default = col_character(),
  UNITID = col_double(),
  MAJORNUM = col_double(),
  CTOTALT = col_double(),
  CTOTALM = col_double(),
  CTOTALW = col_double(),
  CAIANT = col_double(),
  CAIANM = col_double(),
  CAIANW = col_double(),
  CASIAT = col_double(),
  CASIAM = col_double(),
  CASIAW = col_double(),
  CBKAAT = col_double(),
  CBKAAM = col_double(),
  CBKAAW = col_double(),
  CHISPT = col_double(),
  CHISPM = col_double(),
  CHISPW = col_double(),
  CNHPIT = col_double(),
  CNHPIM = col_double(),
  CNHPIW = col_double()
  # ... with 12 more columns
)
See spec(...) for full column specifications.
df_completions_10 = read_csv('c2010_a.csv')
Parsed with column specification:
cols(
  .default = col_character(),
  UNITID = col_double(),
  MAJORNUM = col_double(),
  CNRALM = col_double(),
  CNRALW = col_double(),
  CRACE03 = col_double(),
  CRACE04 = col_double(),
  CRACE05 = col_double(),
  CRACE06 = col_double(),
  CRACE07 = col_double(),
  CRACE08 = col_double(),
  CRACE09 = col_double(),
  CRACE10 = col_double(),
  CRACE11 = col_double(),
  CRACE12 = col_double(),
  CUNKNM = col_double(),
  CUNKNW = col_double(),
  CTOTALM = col_double(),
  CTOTALW = col_double(),
  CNRALT = col_double(),
  CRACE18 = col_double()
  # ... with 42 more columns
)
See spec(...) for full column specifications.
df_hd_14 <- read_csv('hd2014.csv')
Parsed with column specification:
cols(
  .default = col_double(),
  INSTNM = col_character(),
  ADDR = col_character(),
  CITY = col_character(),
  STABBR = col_character(),
  ZIP = col_character(),
  CHFNM = col_character(),
  CHFTITLE = col_character(),
  EIN = col_character(),
  OPEID = col_character(),
  WEBADDR = col_character(),
  ADMINURL = col_character(),
  FAIDURL = col_character(),
  APPLURL = col_character(),
  NPRICURL = col_character(),
  VETURL = col_character(),
  ATHURL = col_character(),
  ACT = col_character(),
  CLOSEDAT = col_character(),
  IALIAS = col_character(),
  F1SYSNAM = col_character()
  # ... with 1 more columns
)
See spec(...) for full column specifications.

These are fairly large tables. For example, df_effy has 16688 rows and df_hd has 7687. Lets take a peek:

head(df_effy_14)
head(df_hd_14)
head(df_completions_14)

The documentation for the data files explains the somewhat cryptic abbreviations used for column labels. For the EFFY surveys, the really important columns are UNITID (institutional ID number), EFYTOTLT (grand total fall enrollment), EFYTOTLM (total men), EFTYTOTLW (total women), LSTUDY (level of study – undergraduate 1, graduate 3, or combined 999).

Let’s get some figures from some of these tables.

enrollment_2018 <-select(df_effy_18,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2017 <-select(df_effy_17,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2016 <-select(df_effy_16,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2015 <-select(df_effy_15,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2014 <-select(df_effy_14,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2013 <-select(df_effy_13,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2012 <-select(df_effy_12,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2011 <-select(df_effy_11,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2010 <-select(df_effy_10,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)

Lets rename some of the data in these tables

enrollment_2018 <- rename(enrollment_2018,"university_id" = UNITID, "total_enrollment_2018"=EFYTOTLT,"men_enrollment_2018"=EFYTOTLM, "women_enrollment_2018"=EFYTOTLW,"level_study"=LSTUDY)

enrollment_2017 <- rename(enrollment_2017,"university_id" = UNITID, "total_enrollment_2017"=EFYTOTLT,"men_enrollment_2017"=EFYTOTLM, "women_enrollment_2017"=EFYTOTLW,"level_study"=LSTUDY)

enrollment_2016 <- rename(enrollment_2016,"university_id" = UNITID, "total_enrollment_2016"=EFYTOTLT,"men_enrollment_2016"=EFYTOTLM, "women_enrollment_2016"=EFYTOTLW,"level_study"=LSTUDY)

enrollment_2015 <- rename(enrollment_2015,"university_id" = UNITID, "total_enrollment_2015"=EFYTOTLT,"men_enrollment_2015"=EFYTOTLM, "women_enrollment_2015"=EFYTOTLW,"level_study"=LSTUDY)

enrollment_2014 <- rename(enrollment_2014,"university_id" = UNITID, "total_enrollment_2014"=EFYTOTLT,"men_enrollment_2014"=EFYTOTLM, "women_enrollment_2014"=EFYTOTLW,"level_study"=LSTUDY)
enrollment_2013 <- rename(enrollment_2013,"university_id" = UNITID,"total_enrollment_2013"=EFYTOTLT,"men_enrollment_2013"=EFYTOTLM, "women_enrollment_2013"=EFYTOTLW,"level_study"=LSTUDY)
enrollment_2012 <- rename(enrollment_2012,"university_id" = UNITID,"total_enrollment_2012"=EFYTOTLT,"men_enrollment_2012"=EFYTOTLM, "women_enrollment_2012"=EFYTOTLW,"level_study"=LSTUDY)
enrollment_2011 <- rename(enrollment_2011,"university_id" = UNITID,"total_enrollment_2011"=EFYTOTLT,"men_enrollment_2011"=EFYTOTLM, "women_enrollment_2011"=EFYTOTLW,"level_study"=LSTUDY)
enrollment_2010 <- rename(enrollment_2010,"university_id" = UNITID,"total_enrollment_2010"=EFYTOTLT,"men_enrollment_2010"=EFYTOTLM, "women_enrollment_2010"=EFYTOTLW,"level_study"=LSTUDY)

Let’s take a gander at these relabeled data sets

head(enrollment_2018)

We can see that the enrollment dataframes represent each college three times, split up by level of study. (We could further tidy the data so that it was split up by level of study AND gender, but we’ve left it this way for now.)

The next step is to combine all the enrollment data into a single data frame. We accomplish this with a pipeline of joins on the university id and level of study. The idea is that in the total_enrollment table each university is represented by three rows: one with undergraduate enrollment, one with graduate enrollment, and one with combined enrollment. Each row has data for the 9 years in question (2010-2018), which is further split up by gender.

total_enrollment <- enrollment_2010 %>%
  inner_join(enrollment_2011,by=c("university_id","level_study")) %>%
  inner_join(enrollment_2012, by =c("university_id","level_study"))%>%
  inner_join(enrollment_2013,by=c("university_id","level_study"))%>%
  inner_join(enrollment_2014,by=c("university_id","level_study"))%>%
  inner_join(enrollment_2015,by=c("university_id","level_study")) %>%
  inner_join(enrollment_2016, by =c("university_id","level_study"))%>%
  inner_join(enrollment_2017,by=c("university_id","level_study"))%>%
  inner_join(enrollment_2018,by=c("university_id","level_study"))
head(total_enrollment)

This table is quite nice – it contains the data for each university. Let’s split it up into combined/undergrad/grad tables. I mainly did this just as an exercise in pipeline and filtering.


total_enrollment_combined <- total_enrollment %>%
  filter(level_study==999) %>%
  select(-level_study)

total_enrollment_undergrad <- total_enrollment %>%
  filter(level_study==1)%>%
  select(-level_study)

total_enrollment_grad <- total_enrollment %>%
  filter(level_study==3)%>%
  select(-level_study)

Now if we examine our new tables they have the data nicely divided:

head(total_enrollment_combined)
head(total_enrollment_undergrad)
head(total_enrollment_grad)

We can now do a bit of data exploration toward answering the questions listed at the beginning of the document. Simple summary statistics like average enrollment will be somewhat limited in application because the institutions vary a lot in size. Averaging changes between institutions of different sizes is also a somewhat shaky approach – a 2 percent dip in a huge university might represent a much bigger change than a 20 percent dip in a tiny college. Therefore the best idea seems to be to combine all numbers across all the schools.

enrollment_combined_no_gender <- total_enrollment_combined %>%
  select(total_enrollment_2010,total_enrollment_2011,total_enrollment_2012,
         total_enrollment_2013,total_enrollment_2014,total_enrollment_2015,total_enrollment_2016,total_enrollment_2017,
         total_enrollment_2018)
combined_enrollment_figs <- colSums(enrollment_combined_no_gender)
years <- c(2010,2011,2012,2013,2014,2015,2016,2017,2018)
combined_enrollment_figs_df <- data.frame(years,combined_enrollment_figs)
combined_enrollment_viz <- ggplot(data=combined_enrollment_figs_df,aes(x=years,y=combined_enrollment_figs))+
  geom_point()+
  geom_line() +
  theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank())+
  labs(title="Combined Enrollment (2010-2018)",x="Year",y="Combined enrollment (all institutions)")
ggsave(filename="combined_enrollment_viz.png",plot=combined_enrollment_viz,dpi=800)
Saving 7 x 7 in image
combined_enrollment_viz

Okay: the upshot is that combined (undergrad + grad) enrollment is going down.

Now let’s look at undergrad enrollment over the same period:

enrollment_undergrad_no_gender <- total_enrollment_undergrad %>%
  select(total_enrollment_2010,total_enrollment_2011,total_enrollment_2012,
         total_enrollment_2013,total_enrollment_2014,total_enrollment_2015,
         total_enrollment_2016,total_enrollment_2017,total_enrollment_2018)

undergrad_enrollment_figs <- colSums(enrollment_undergrad_no_gender)

years <- c(2010,2011,2012,2013,2014,2015,2016,2017,2018)
undergrad_enrollment_figs_df <- data.frame(years,undergrad_enrollment_figs)
undergrad_enrollment_viz <- ggplot(data=undergrad_enrollment_figs_df,aes(x=years,y=undergrad_enrollment_figs))+
  geom_point()+
  geom_line() +theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank())+
  labs(title="Undergrad Enrollment (2010-2018)",x="Year",y="Undergrad enrollment (all institutions)")
ggsave(filename="undergrad_enrollment_viz.png",plot=undergrad_enrollment_viz,dpi=800)
Saving 7 x 7 in image
undergrad_enrollment_viz

The undergraduate trend follows the same trend as the combined trend (no surprises there).

It’s less important, but we can also take a look at graduate enrollment over the same period.

enrollment_grad_no_gender <- total_enrollment_grad %>%
  select(total_enrollment_2010,total_enrollment_2011,total_enrollment_2012,
         total_enrollment_2013,total_enrollment_2014,total_enrollment_2015,
         total_enrollment_2016,total_enrollment_2017,total_enrollment_2018)
grad_enrollment_figs <- colSums(enrollment_grad_no_gender)
years <- c(2010,2011,2012,2013,2014,2015,2016,2017,2018)
grad_enrollment_figs_df <- data.frame(years,grad_enrollment_figs)
grad_enrollment_viz <- ggplot(data=grad_enrollment_figs_df,aes(x=years,y=grad_enrollment_figs))+
  geom_point()+
  geom_line() +
  theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank())+
  labs(title="Graduate Enrollment (2010-2018)",x="Year",y="Graduate enrollment (all institutions)")
ggsave(filename="grad_enrollment_viz.png",plot=grad_enrollment_viz,dpi=800)
Saving 7 x 7 in image
grad_enrollment_viz

If we juxtapose the three images, we see that the loss in overall enrollment seems to be driven by loss in undergraduate enrollment, and that the graduate enrollment rallied slightly starting in 2013. Since graduate enrollment represents about 10 percent of the combined enrollment, the rally in graduate enrollment does not balance out the diminishing undergraduate enrollment.

combined_enrollment_viz

undergrad_enrollment_viz

grad_enrollment_viz

The overall trend is: undergraduate enrollment is going down.

The EFFY survey data is split up by gender, so we can try to see if there are any interesting trends by gender.

enrollment_combined_men <- total_enrollment_combined %>%
  select(men_enrollment_2010,men_enrollment_2011,men_enrollment_2012,
         men_enrollment_2013,men_enrollment_2014,men_enrollment_2015,
         men_enrollment_2016,men_enrollment_2017,men_enrollment_2018)
combined_men_enrollment_figs <- colSums(enrollment_combined_men)
years <- c(2010,2011,2012,2013,2014,2015,2016,2017,2018)
combined_enrollment_men_figs_df <- data.frame(years,combined_men_enrollment_figs)
combined_enrollment_men_viz <- ggplot(data=combined_enrollment_men_figs_df,aes(x=years,y=combined_men_enrollment_figs))+
  geom_point()+
  geom_line() +
  labs(title="Combined Mens Enrollment (2010-2018)",x="Year",y="Combined mens enrollment (all institutions)")+
  theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank())
ggsave(filename="combined_men_enrollment_viz.png",plot=combined_enrollment_men_viz,dpi=800)
Saving 7 x 7 in image
combined_enrollment_men_viz

Not exactly a shock: the combined mens enrollment is going down quite sharply (a bit more sharply than the mens+womens combined enrollment).

enrollment_combined_women <- total_enrollment_combined %>%
  select(women_enrollment_2010,women_enrollment_2011,women_enrollment_2012,
         women_enrollment_2013,women_enrollment_2014,women_enrollment_2015,
         women_enrollment_2016,women_enrollment_2017,women_enrollment_2018)
combined_women_enrollment_figs <- colSums(enrollment_combined_women)
years <- c(2010,2011,2012,2013,2014,2015,2016,2017,2018)
combined_enrollment_women_figs_df <- data.frame(years,combined_women_enrollment_figs)
combined_enrollment_women_viz <- ggplot(data=combined_enrollment_women_figs_df,aes(x=years,y=combined_women_enrollment_figs))+
  geom_point()+
  geom_line() +
  labs(title="Combined Womens Enrollment (2010-2018)",x="Year",y="Combined womens enrollment (all institutions)")+
  theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank())
ggsave(filename="combined_women_enrollment_viz.png",plot=combined_enrollment_women_viz,dpi=800)
Saving 7 x 7 in image
combined_enrollment_women_viz

This gives us a clue that a slight recovery in enrollment has been driven by women returning to college before men.

Now we want this to include some other data: geographical region (this is from the HD data set) and discplines.

regions = select(df_hd_14,"university_id"=UNITID,"region_code"=OBEREG)
head(regions)

The university_id’s are familiar to us, and the region_code’s are as follows (this can be found on the Excel files included in the data download).

0 - US Service schools (such as West Point or the Merchant Marine Academy) 1 - New England (CT ME MA NH RI VT) 2 - Mid East (DE DC MD NJ NY PA) 3 - Great Lakes (IL IN MI OH WI) 4 - Plains (IA KS MN MO NE ND SD) 5 - Southeast (AL AR FL GA KY LA MS NC SC TN VA WV) 6 - Southwest (AZ NM OK TX) 7 - Rocky Mountains (CO ID MT UT WY) 8 - Far West (AK CA HI NV OR WA) 9 - Outlying areas (AS FM GU MH MP PR PW VI) -3 - Not available

Sidenote: it’s debatable if these regions are ideal. New England is well-defined and conventional, but, e.g., bundling Minnesota with Kansas instead of with Wisconsin is maybe debatable.

Now we want to combine the regions data with the enrollment data. The following code combines the enrollment data from the EFFY survey with the region codes so that we can analyze regional trends.

total_enrollment_regions <- total_enrollment %>%
  inner_join(regions, by = "university_id")
combined_enrollment_regions <- total_enrollment_regions %>%
  filter(level_study==999)


combined_no_gender_enrollment_regions <- combined_enrollment_regions %>%
  select(region_code,total_enrollment_2010,total_enrollment_2011,total_enrollment_2012,
         total_enrollment_2013,total_enrollment_2014,total_enrollment_2015,
         total_enrollment_2016,total_enrollment_2017,total_enrollment_2018)

We haven’t totally tidied the data, but it’s good enough for our purposes (the ideal tidied data table contains a single measurement in each row, so we wouldn’t have enrollment figures from multiple years in the same row). The following chunk serves to find the total enrollment per region as a function of the year.

combined_no_gender_enrollment_regions_by_regions <- combined_no_gender_enrollment_regions %>%
  group_by(region_code)%>%
  transmute("2010"=sum(total_enrollment_2010),
         "2011"=sum(total_enrollment_2011),
         "2012"=sum(total_enrollment_2012),
         "2013"=sum(total_enrollment_2013),
         "2014"=sum(total_enrollment_2014),
         "2015"=sum(total_enrollment_2015),
         "2016"=sum(total_enrollment_2016),
         "2017"=sum(total_enrollment_2017),
         "2018"=sum(total_enrollment_2018)) %>%
  distinct(region_code,.keep_all = TRUE) %>%
  arrange(region_code)
reshaped_regions_no_gender <- combined_no_gender_enrollment_regions_by_regions %>%
  gather('2010','2011','2012','2013','2014','2015','2016','2017','2018',key='year',value='enrollment')

Lets take a look at this new table: it contains only three variables (region_code, year, and enrollment), and each row represents a single enrollment figure.

head(reshaped_regions_no_gender)

Let’s get into the plotting of this data. First we make a dataframe.

reshaped_regions_no_gender_df <- data.frame(reshaped_regions_no_gender)
reshaped_regions_no_gender_df

The region code 0 stands for US Service Academies like West Point, where enrollment is basically constant (as well as being a tiny fraction of any of the other regions). The same thing is true for the outlying areas such as Guam. So we might as well filter it out

regions_no_gender_df <- 
  reshaped_regions_no_gender_df %>%
  filter(region_code != 0 & region_code != 9)
head(regions_no_gender_df)

Now we can create our list of objects.

labels_regions <-  c('New England (CT ME MA NH RI VT)', 
                     'Mid East (DE DC MD NJ NY PA)',
                     'Great Lakes (IL IN MI OH WI)', 
                     'Plains (IA KS MN MO NE ND SD)',             
                     'Southeast (AL AR FL GA KY LA MS NC SC TN VA WV)',
                     'Southwest (AZ NM OK TX)',
                     'Rocky Mountains (CO ID MT UT WY)', 
                     'Far West (AK CA HI NV OR WA)')

Let’s build a more concise set of labels:

labels_regions_concise <- c('New England', 
                            'Mid East',
                            'Great Lakes', 
                            'Plains', 
                            'Southeast',
                            'Southwest',
                            'Rocky Mountains', 
                            'Far West')

The following is a ggplot visualization.

regions_viz <- ggplot(data=regions_no_gender_df,aes(x=strtoi(year)-2000,y=enrollment/10^6,color=factor(region_code),group=factor(region_code),label=factor(region_code)))+
  geom_point()+
  geom_line(key_glyph="rect")+
  theme_light()+
  theme(legend.title = element_text(color="chocolate",size=16,face="bold"),
        legend.background = element_rect(fill="white",
                                  size=0.5, linetype="solid", 
                                  colour ="darkblue"),
        legend.key = element_rect(fill = "white", colour = "black"),
        panel.grid.major=element_blank(),
        panel.grid.minor=element_blank())+
  scale_color_hue(name="Region",
                       labels=labels_regions_concise)+
  labs(y="Enrollment (in millions)",x="Year",title="Regional enrollment trends: 2010-2018")
ggsave(filename="regional_trends.png",plot=regions_viz,dpi=800)
Saving 7 x 7 in image
regions_viz

That gives us a picture of some of the trends.

New England is static.

Far West, after a dip in 2013, seems to be recovering.

Rocky Mountains is increasing at a modest but steady rate.

The Great Lakes region is decreasing.

=======DISCIPLINARY TRENDS========

Okay, next we would like to get some disciplinary information added in. Remember that these are stored in the completions tables:

head(df_completions_14)

The important columns here are UNITID (university ID code), CIPCODE (discplinary subject), CTOTALT (total completions in the discipline).

We have a list of CIPCODEs of the form xx.xxxx. The first two digits store the broad region (such as Biology or Education) and the last four indicate the specialty (such as Biochemistry or Wildlife Biology).

At first, we want to study broad trends, so we are only interested in the trends at the xx.-level. We need to split up the CIP code by the ‘.’ delimiter and then grab the first part.

completions_split_10 <- df_completions_10 %>%
  mutate(coarsecip = substr(CIPCODE,1,2),
         subcip = substr(CIPCODE,2,2))
completions_coarse_10 <- completions_split_10 %>%
  select(UNITID,coarsecip,CTOTALT)

Lets see if that worked:

head(completions_coarse_10)

Not too bad! But we only want the totals grouped by CIP – we don’t really care about the UNITID:

completions_coarse_by_cip_10 <- completions_coarse_10 %>%
  group_by(coarsecip) %>%
  summarize("2010" = sum(CTOTALT))
  completions_coarse_by_cip_10

That’s what we wanted: the total number of completed degrees by CIPCode. In the chunk below we have repeated this operation for each of the other years in our dataset.

completions_split_11 <- df_completions_11 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_11 <- completions_split_11 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2011" = sum(CTOTALT))
  
completions_split_12 <- df_completions_12 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_12 <- completions_split_12 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2012" = sum(CTOTALT))

completions_split_13 <- df_completions_13 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_13 <- completions_split_13 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2013" = sum(CTOTALT))

completions_split_14 <- df_completions_14 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_14 <- completions_split_14 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2014" = sum(CTOTALT))

completions_split_15 <- df_completions_15 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_15 <- completions_split_15 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2015" = sum(CTOTALT))

completions_split_16 <- df_completions_16 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_16 <- completions_split_16 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2016" = sum(CTOTALT))

completions_split_17 <- df_completions_17 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_17 <- completions_split_17 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2017" = sum(CTOTALT))

completions_split_18 <- df_completions_18 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_18 <- completions_split_18 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2018" = sum(CTOTALT))

We now have reasonably organized completion data for all the programs. The next step is to combine it into a single data frame.

We want the coarse_cip completion data to be organized into a single table.

all_completions_cip <- completions_coarse_by_cip_10 %>%
  merge(completions_coarse_by_cip_11) %>%
  merge(completions_coarse_by_cip_12) %>%
  merge(completions_coarse_by_cip_13) %>%
  merge(completions_coarse_by_cip_14) %>%
  merge(completions_coarse_by_cip_15) %>%
  merge(completions_coarse_by_cip_16) %>%
  merge(completions_coarse_by_cip_17) %>%
  merge(completions_coarse_by_cip_18)
all_completions_cip 

This is nice, but we’d really like to have the years as variables, in order to comply with the tidy data paradigm. (We could have done this earlier.)

reshaped_completions <- all_completions_cip %>%
  gather("2010","2011","2012","2013","2014","2015","2016","2017","2018",key="year",value="completions")
reshaped_completions

This is properly tidied coarse_cip completion data by year.

There is one that we would like to exclude: Cip code 99, which describes the grand total of all degree completions.

reshaped_completions <- reshaped_completions %>%
  filter(coarsecip != '99')

Now we can do a big facet graph, plotting for each CIPCODE a different line graph showing completions of degrees in that CIPCODE over the years 2010-2018.

WARNING: the first version is going to look very rough in RStudio. However the version that saves to the directory is not as cluttered.

bad_facets <- ggplot(data = reshaped_completions, aes(year, completions,group=1)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point(color="steelblue") + 
  labs(title = "Degree completions by CIP code (2010-2018)",
       y = "Count of degree completions", x = "year") + 
  facet_wrap(~coarsecip)

ggsave(filename='bad_facets.png',plot=bad_facets,dpi=800)
Saving 7 x 7 in image
bad_facets

This gives us some good trends (09, 11, 14, 24, 51, 52) all seem to be increasing. There is one thing we can do to make it a bit more meaningful: swap out cip codes for names (which I’ve summarized because the CIPcode descriptions are sometimes lengthy).

codes_number <- c(1,3,4,5,9,10,11,12,13,14,15,16,19,22,23,24,25,26,27, 29,30,31,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,54)
codes_chr <- as.character(codes_number)
cip_data <- data.frame("coarsecip" = codes_chr, "area" = c("agri","envsci","arch/urb","area studies","comms/journ","graph des.","comp.","beauty","educ","engns","technician","langs","human sci","law","lit","lib arts","library","bio","math","intel","gen studies","mgmt","phil","rel","chem","sci tech","psych","responder","admin","soc sci","trades","assorted tech","prod trade","air traff","art","health","bus/fin","history"))
cip_data

Let’s get the area names in there

reshaped_completions
reshaped_completions_areas <- reshaped_completions %>%
  merge(cip_data) %>%
  arrange(desc(coarsecip))
reshaped_completions_areas

Now lets do the same facet wrap:

discipline_viz <- ggplot(data = reshaped_completions_areas, aes(year, completions,group=1)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point(color="steelblue") + 
  labs(title = "Degree completions by area (2010-2018)",
       y = "Count of degree completions", x = "year") +
  
  facet_wrap(~area)+
  theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank())
ggsave(filename='discipline_graph.png',plot=discipline_viz,dpi=400)
Saving 7 x 7 in image
discipline_viz

This is nice (it is tiny in Rstudio but I recommend looking at the saved image discipline_graph.png in the directory), but it washes out a lot of the trends for the lower-enrollment areas. So we might want to look at specific disciplines.

Let’s create a new dataframe that tracks something very coarse: percent growth in a field over the time interval 2010-2018. We calculate this as 100*(change in completions)/(2010 completions).

head(reshaped_completions_areas)
net_completions <-  reshaped_completions_areas %>% 
  filter(year == 2010 | year == 2018) %>%
  spread(year, completions) 
colnames(net_completions) <- c('coarsecip','area','comps_2010','comps_2018')
net_completions <- net_completions %>%
  mutate(percent_change = 100*(comps_2018/comps_2010-1)) %>%
  arrange(desc(percent_change))
head(net_completions,10)

We see that the 10 most increasing areas are those listed above To clarify these abbreviations: sci tech covers basically any degree preparing for work as a technician in a lab, comp. is computer science, prod trade is degrees related to production like manufacturing, mgmt is management science, engns is engineering.

Some of these are not surprising (nobody can be shocked by bronze medal for computer areas); some are more surprising (liberal arts has a lot more completions despite a lot of press about the death of liberal arts degrees).

Answer to one of our questions: A comparatively small investment in educational resources could return major benefits; also more math and computer science resources.

The visualization below makes this a bit more striking. The red line represents a 0 percent change in the number of degree completions.

percents_viz <- ggplot(data=net_completions,aes(x=area,y=percent_change))+
  geom_point()+
  theme(axis.text.x = element_text(angle = -90,vjust=0.45),
        panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        panel.background = element_blank())+
  geom_hline(yintercept=0,color='red')+
  labs(title="Net change in degree completion: 2010-2018",x='area',y='percent change in enrollment')
percents_viz

This is a bit cluttered: let’s pin down the 10 fastest-growing and 10 fastest-shrinking areas.

ten_fastest <- net_completions %>%
  arrange(desc(percent_change)) %>%
  slice(1:10)
#ten_fastest

ten_worst <- net_completions %>%
  arrange(percent_change) %>%
  slice(1:10)
#ten_worst

fastest_viz <- ggplot(data=ten_fastest,aes(x=area,y=percent_change))+
  geom_point()+
  theme(axis.text.x = element_text(angle = 30))+
  geom_hline(yintercept=0,color='red')+
  theme(axis.text.x = element_text(angle = -90,vjust=0.45),
        panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        panel.background = element_blank())+
  labs(title="Fastest areas of growth: 2010-2018",x='area',y='percent (+) change in enrollment')+
  ggsave('fastest_viz.png',dpi=800)
Saving 7 x 7 in image
slowest_viz <- ggplot(data=ten_worst,aes(x=area,y=percent_change))+
  geom_point()+
  theme(axis.text.x = element_text(angle = 30))+
  theme(axis.text.x = element_text(angle = -90,vjust=0.45),
        panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        panel.background = element_blank())+
  geom_hline(yintercept=0,color='red')+
  labs(title="Weakest areas of growth: 2010-2018",x='area',y='percent change in enrollment')+
  ggsave('weakest_viz.png',dpi=800)
fastest_viz

slowest_viz

We saw that there was some good growth in the mathematics area. Anyone who follows higher-educational news in mathematics will know that statistics is growing in popularity as a major course of study. However, the Completions survey does not provide a separate “Coarse” CIP for stats; it’s stuck within the completions for math majors. Let’s see if we can extract some useful information about the completion of statistics degrees.

Remember what the non-coarse completions data looked like:

head(df_completions_10)

We still want to break apart the CIPCODE but now we want to filter to the Coarse CIP 27 – this is where all the mathematics is. Here are the fine CIPCODES for Mathematics (I started converting the codes to character by hand before I remembered you can cast to character vectorwise)

subfields_names <- c("Mathematics, General",
                     "Algebra and Number Theory",
                     "Analysis and Functional Analysis",
                     "Topology and Foundations",
                     "Mathematics, Other",
                     "Applied Mathematics (General)",
                     "Computational Mathematics",
                     "Computational and Applied Mathematics",
                     "Financial Mathematics",
                     "Mathematical Biology",
                     "Applied Mathematics (Other)",
                     "Statistics",
                     "Math-stats and probability",
                     "Math-stats",
                     "Statistics (Other)",
                     "Math-stats (Other)")
subfields_codes <- c(27.0101,
                     27.0102,
                     27.0103,
                     27.0105,
                     27.019,
                     27.0301,
                     27.0303,
                     27.0304,
                     27.0305,
                     27.0306,
                     27.0399,
                     27.0501,
                     27.0502,
                     27.0503,
                     27.0599,
                     27.9999)
subfields_codes <- as.character(subfields_codes)
subfields_df <- data.frame("subfield"=subfields_names,"CIPCODE"=subfields_codes)
subfields_df

If like me you find the list of subfields kind of ridiculous and redundant, don’t worry, we will simplify it later on, after we have summarized a bit.

We have the split dataframes. Here the math completions will have coarsecip = 27 and the subcip will indicate what specialty the degree was in (like stats or math-econ).

completions_split_10 %>% 
  select(UNITID,CTOTALT,coarsecip) %>%
  filter(coarsecip == '27')

These are quite long, with separate completion data by gender and ethnic lines.

completions_split_10 %>%
  filter(coarsecip == '27') %>%
  group_by(CIPCODE) %>%
  summarise("2010"=sum(CTOTALT))
math_completions_10 <- completions_split_10 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2010'=sum(CTOTALT))

math_completions_11 <- completions_split_11 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2011'=sum(CTOTALT))

math_completions_12 <- completions_split_12 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2012'=sum(CTOTALT))

math_completions_13 <- completions_split_13 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2013'=sum(CTOTALT))

math_completions_14 <- completions_split_14 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2014'=sum(CTOTALT))

math_completions_15 <- completions_split_15 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2015'=sum(CTOTALT))

math_completions_16 <- completions_split_16 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2016'=sum(CTOTALT))

math_completions_17 <- completions_split_17 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2017'=sum(CTOTALT))

math_completions_18 <- completions_split_18 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2018'=sum(CTOTALT))

Let’s take a peep:

math_completions_18

Now we have all the data for all the subfields for the years 2010-2018 stored in nine separate tables. Again we need to merge them

math_completions <- math_completions_10 %>%
  merge(math_completions_11,how='full') %>%
  merge(math_completions_12,how='full') %>%
  merge(math_completions_13,how='full') %>%
  merge(math_completions_14,how='full') %>%
  merge(math_completions_15,how='full') %>%
  merge(math_completions_16,how='full') %>%
  merge(math_completions_17,how='full') %>%
  merge(math_completions_18,how='full')
head(math_completions)

Now we’re cooking! We ust need to tidy up the data.

Let’s add in a column of all the subfield names, then tidy along the year axis.

math_completions <- math_completions %>%
  merge(subfields_df)
math_completions <- math_completions %>%
  gather("2010","2011","2012","2013","2014","2015","2016","2017","2018",key="year",value="completions")
math_completions

Before we visualize, it will be handy to have abbreviated names for each subfield.

subfields_df
abbr_math_subfield <- c('math',
                        'algebra',
                        'analysis',
                        'topology',
                        'mathother',
                        'applied',
                        'compmath',
                        'compmath+appl',
                        'finmath',
                        'mathbio',
                        'applmathoth',
                        'stat',
                        'mathstatprob',
                        'mathstat',
                        'statother',
                        'mathstatother')

Let’s update the subfields list with these abbreviated names:

subfields_df$abbrev <- abbr_math_subfield

In preparing this list, it jumps out at you how there are actually fewer real subfields: math, aaplied/computational math, stats.

Lets write a crude-subfield functions

crude_subfield <- function(t) {
  if (t %in%
      c('applied',
        'compmath',
        'compmath+appl',
        'finmath',
        'mathbio',
        'applmathoth')){
    'applied'
  }
  else if (t %in%
      c('math',
        'algebra',
        'analysis',
        'topology',
        'mathother')){
    'math'
  }
  else if (t %in% 
      c('stat',
        'mathstatprob',
        'mathstat',
        'statother',
        'mathstatother')){
    'stats'
  }
  }

Let’s test our function

crude_subfield('analysis')
[1] "math"
crude_subfield('statother')
[1] "stats"
crude_subfield('mathbio')
[1] "applied"

Now lets add a crude subfield column to the subfields dataframe

subfields_df_dec <- subfields_df %>% 
  mutate(crude = map_chr(abbrev,crude_subfield))
math_completions_dec <- math_completions %>%
  merge(subfields_df_dec, on = 'CIPCODE')
summarised_math_completions <- math_completions_dec %>%
  group_by(crude,year) %>%
  summarise('total_completions'=sum(completions))

Let’s visualize.

With the disclaimer that these are pretty crude bins, you can see strong growth across the board, with the applied fields (which includes computational math, applied math, and math bio) growing the fastest in recent years.

Let’s look at percent changes:

net_math_completions <-  math_completions_dec %>% 
  filter(year == 2010 | year == 2018) %>%
  select(year,completions,abbrev,crude)
net_math_completions <- net_math_completions %>%
  spread(year,completions)
colnames(net_math_completions)=c('subfield','crude_subfield','math_comps_10','math_comps_18')
net_math_completions <- net_math_completions %>%
  mutate(percent_change = 100*(math_comps_18/(math_comps_10+1)-1), net_change = math_comps_18-math_comps_10) %>%
  arrange(desc(percent_change))
net_math_completions$percent_change
 [1] 3400.00000  994.00000  834.37500  481.81818  375.51020  233.58209
 [7]  158.10664  121.80451  104.52088   53.04054   43.48758  -22.66667
v <- net_math_completions$percent_change
v[1] <- 100
net_math_completions$percent_change <- v
net_math_completions

We can also do the crude versiomn of this.

net_math_crude_completions <- net_math_completions %>%
  group_by(crude_subfield) %>%
  summarise(total_completions_10 = sum(math_comps_10),
            total_completions_18 = sum(math_comps_18))

net_math_crude_completions <- net_math_crude_completions %>%
  mutate(net_change = total_completions_18-total_completions_10,
         pct_change = (total_completions_18-total_completions_10)/total_completions_10)

Let’s visualize these growth rates:

pct_math_scatter <- ggplot(data=net_math_completions,aes(x=subfield,y=percent_change))+
  geom_point()+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90),
        panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        axis.ticks = element_blank())+
  geom_hline(yintercept=0,color='red')+
  labs(title="Math subfield growth (by pct): 2010-2018",x='area',y='pct change in degree completions')+
  ggsave('math_fastest_viz.png',dpi=800)+
  coord_flip()
Saving 7 x 7 in image
pct_math_scatter

pct_math_scatter <- ggplot(data=net_math_completions,aes(x=subfield,y=net_change))+
  geom_point()+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90),
        panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        axis.ticks = element_blank())+
  geom_hline(yintercept=0,color='red')+
  labs(title="Math subfield growth (net): 2010-2018",x='area',y='net change in degree completions')+
  ggsave('math_fastest_viz.png',dpi=800)+
  coord_flip()
Saving 6.28 x 3.88 in image

pct_math_scatter

Let’s look at how the distributions have shifted over the period. (I know that pie charts have been deprecated )

Let’s do some tidying

Lets crudify it and switch to proportions.

ggsave("crude_bar_math.png",crude_bar_math,dpi=800)
Saving 7 x 7 in image
ggsave("crude_bar_math.png",crude_bar_math,dpi=800)
crude_bar_math

This shows that stats and applied fields are really growing within the math major, although we have already seen that there is a considerable amount of growth in general mathematics degrees. Any company that is trying to sell books and software to these students should look into publishing more applied and statistics titles.

======== UNFINISHED ANALYSIS OF OF COMPLETIONS DATA FOR COMPUTER SCIENCE ====

Below we sketch how to begin carrying out the same analysis for the Computer Science category (CoarseCIP: 11). The interested reader can update some of the commands and finish it.

subfields_names <- c("Gen CS","AI","IT","Informatics","CS (Other)",
                     "Gen program","Spec. program",
                     "Cert program", "Other program",
                     "Data processing","Info sci",
                     "Comp sys analyst","Data entry",
                     "Word proc","Other data entry",
                     "CS","Web page design","Data model",
                     "Graphics","Modelling","Software",
                     "Networks/Telecom","Network admin",
                     "LAN admin","Security","Webmaster",
                     "IT Project","Comp supp","IT Admin",
                     "Other CS Supp")
subfields_codes <- c(11.0101, 11.0102, 11.0103, 11.0104,
                     11.0199, 11.0201, 11.0202, 11.0203,
                     11.0299, 11.0301, 11.0401, 11.0501,
                     11.0601, 11.0602, 11.0699, 11.0701,
                     11.0801, 11.0802, 11.0803, 11.0804,
                     11.0899, 11.0901, 11.1001, 11.1002,
                     11.1003, 11.1004, 11.1005, 11.1006,
                     11.1099, 11.9999)
subfields_codes <- as.character(subfields_codes)
subfields_df <- data.frame("subfield"=subfields_names,"CIPCODE"=subfields_codes)
subfields_df

cs_completions_10 <- df_completions_10 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2010'=sum(CTOTALT))

cs_completions_11 <- df_completions_11 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2011'=sum(CTOTALT))

cs_completions_12 <- df_completions_12 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2012'=sum(CTOTALT))

cs_completions_13 <- df_completions_13 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2013'=sum(CTOTALT))

cs_completions_14 <- df_completions_14 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2014'=sum(CTOTALT))

cs_completions_15 <- df_completions_15 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2015'=sum(CTOTALT))

cs_completions_16 <- df_completions_16 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2016'=sum(CTOTALT))

cs_completions_17 <- df_completions_17 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2017'=sum(CTOTALT))

cs_completions_18 <- df_completions_18 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2018'=sum(CTOTALT))
cs_completions <- cs_completions_10 %>%
  merge(cs_completions_11,how='full') %>%
  merge(cs_completions_12,how='full') %>%
  merge(cs_completions_13,how='full') %>%
  merge(cs_completions_14,how='full') %>%
  merge(cs_completions_15,how='full') %>%
  merge(cs_completions_16,how='full') %>%
  merge(cs_completions_17,how='full') %>%
  merge(cs_completions_18,how='full')


cs_completions <- cs_completions %>%
  merge(subfields_df) %>%
  select(-CIPCODE)

cs_completions <- cs_completions %>%
  gather("2010","2011","2012","2013","2014","2015","2016","2017","2018",key="year",value="completions")
cs_completions

Finally we can visualize:

cs_subfield_viz <-  ggplot(data = cs_completions, aes(year, completions,group=1)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point(color="steelblue") + 
  labs(title = "CS degrees by subfield (2010-2018)",
       y = "Count of degree completions", x = "year") + 
  facet_wrap(~subfield)
ggsave('discipline_viz.png',dpi=400)
cs_subfield_viz
net_cs_completions <-  cs_completions %>% 
  filter(year == 2010 | year == 2018) %>%
  spread(year, completions)
colnames(net_cs_completions) <- c("subfield","cs_comps_2010","cs_comps_2018")
net_cs_completions <- net_cs_completions %>%
  mutate(percent_change = 100*(cs_comps_2018/cs_comps_2010-1), net_change = cs_comps_2018-cs_comps_2010) %>%
  arrange(desc(percent_change))
net_cs_completions$percent_change
v <- net_cs_completions$percent_change
v[1] <- 100
net_cs_completions$percent_change <- v
net_cs_completions

Let’s visualize these growth rates:

pct_cs_scatter <- ggplot(data=net_cs_completions,aes(x=subfield,y=percent_change))+
  geom_text(aes(label=subfield))+
  theme(axis.text.x = element_text(angle = 90))+
  geom_hline(yintercept=0,color='red')+
  labs(title="CS subfield growth (by pct): 2010-2018",x='area',y='pct change in degree completions')+
  ggsave('cs_overall_pct_viz.png',dpi=400)+
  coord_flip()
pct_cs_scatter

pct_cs_scatter <- ggplot(data=net_cs_completions,aes(x=subfield,y=net_change))+
  theme(axis.text.x = element_text(angle = 90))+
  geom_hline(yintercept=0,color='red')+
  geom_label(aes(label=floor(net_change)))+
  labs(title="CS subfield growth (net): 2010-2018",x='area',y='net change in degree completions')+
  ggsave('cs_overall_net_viz.png',dpi=400)+
  coord_flip()
pct_cs_scatter

Let’s filter this down a bit

(best_net_cs <- net_cs_completions %>%
  arrange(desc(net_change)) %>%
  slice(1:10))

(best_pct_cs <- net_cs_completions %>%
  arrange(desc(percent_change)) %>%
  slice(1:10))
best_net_cs_scatter <- ggplot(data=best_net_cs,aes(x=subfield,y=percent_change))+
  geom_text(aes(label=subfield))+
  theme(axis.text.x = element_text(angle = 90))+
  geom_hline(yintercept=0,color='red')+
  labs(title="Best CS subfield growth (by net): 2010-2018",x='area',y='pct change in degree completions')+
  ggsave('cs_fastest_net_viz.png',dpi=400)+
  coord_flip()
best_net_cs_scatter

best_pct_cs_scatter <- ggplot(data=best_pct_cs,aes(x=subfield,y=percent_change))+
  geom_text(aes(label=subfield))+
  theme(axis.text.x = element_text(angle = 90))+
  geom_hline(yintercept=0,color='red')+
  labs(title="Best CS subfield growth (by pct): 2010-2018",x='area',y='pct change in degree completions')+
  ggsave('cs_fastest_pct_viz.png',dpi=400)+
  coord_flip()
best_pct_cs_scatter

It’s worth noting what the different codes are:

Data modeling includes Data Warehousing and Database Admin

Modelling includs Virtual Environments and Simulation

Many informatics degrees are budding into data science disciplines (e.g. at the University of Washington).

---
title: "Enrollment Trends"
output: html_notebook
---

HIGHER EDUCATION EXPLORATORY DATA ANALYSIS AND VISUALIZATION

============

INTRODUCTION

============

============
QUESTIONS 
============

The questions that I'm trying to answer with this project.

1. What are the trends in higher-educational enrollment?
2. Are the trends regional or discipline specific?
3. Where should higher-education content firms invest in salesforce and content development?


============
DATA SOURCES
============

I used data from the Integrated Postsecondary Education Data System (IPEDS), which is available at https://nces.ed.gov/ipeds/use-the-data. IPEDS has data from a variety of surveys but I used the following three surveys, accessible from the drop-down menu on the website.

Fall enrollment (EFFY). This survey reports the number of enrollments in the fall semester for each responding school, college, university, etc. We use this for overall trends and also for regional trends.  

Completions. This reports the number of degree completions for each institution by department (math, biology, English, etc.) and also by subdiscipline (so for mathematics the data splits into applied math, pure math, math-econ, etc.). We use this to detect trends at the level of academic discipline.

Institutional characteristics - directory information (HD). This records the region for each school. We use this to detect regional trends. 

============
CONCLUSIONS
============

The answers (see below for the data analysis supporting these):

1. Overall enrollment is down from 2013-2018. If you divide the enrollment data by gender, men's enrollment has declined much more sharply than women's enrollment. The data shows a slight recovery in enrollments happening around 2018, but the missing 2019 data might not suppor this, and there is the large issue of the pandemic affecting enrollment starting in 2020 

Somewhat surprisingly, there is still growth in degree completions for many disciplines, in spite of the overall downward trend in enrollment. I'm not sure how to account for this, but my guess is that in any year a large number of students enroll in college who will eventually drop out. It seems possible that downward trend in enrollment trends could eat into this group of students first: they might have less money for college, be on the fence about whether to go to college versus enter the workforce, have less preparation for college, etc. Any factor that would later contribute to a student dropping out of college might plausibly lead to them not enrolling in the first place, particularly during a time of economic uncertainty (or even during a time of economic growth, when non-degree-specific jobs are available). If this group is large enough, they could account for diminishing enrollments while other students, not as apt to drop out of college after enrolling, drive a smaller upward trend in degree completions for many academic subdisciplines. 

An alternate explanation is that some universities have lowered their academic standards in response to diminishing enrollments, in a bid to lose fewer students before degree completion. With the given data I am uncertain of how to choose which of these (or both, or neither) is the correct explanation. 

2. The trends are apparently region and discipline-specific. Regionally the Far West region has seen growth and the Great Lakes region has seen decrease in enrollments, just to name two examples. At the disciplinary level, there has been dramatic decline in History degree completions accompanied by a rise in completions of technical and Computer Science degrees. 

3. Regionally, the Southeast and Far West are among the fastest growing areas. Any regional salesforce should be directed there. At the disciplinary level, Computer Science and Mathematics (including Statistics) are both growing very quickly. Within the Mathematics degree completion data we see a trend toward applied mathematics and statistics, both areas where content creation should be directed. There are several other disciplines that have seen strong growth, particularly scientific technician degrees. 

============
DATA ANALYSIS
============

Here we load a few standard tools for working with dataframes and visualizations in R. 

```{r}
library(dplyr)
library(tidyverse)
library(reshape2)
```

First, we want to load the data (stored on the disk as .csv files) into some data frames. We grab the EFFY and Completions surveys for each of the years 2010-2018. We only need one instance of the HD survey because this survey only contains regional characteristics. 

```{r}
df_effy_18 <- read_csv('effy2018.csv')
df_effy_17 <- read_csv('effy2017.csv')
df_effy_16 <- read_csv('effy2016.csv')
df_effy_15 <- read_csv('effy2015.csv')
df_effy_14 <- read_csv('effy2014.csv')
df_effy_13 <- read_csv('effy2013.csv')
df_effy_12 <- read_csv('effy2012.csv')
df_effy_11 <- read_csv('effy2011.csv')
df_effy_10 <- read_csv('effy2010.csv')

df_completions_18 = read_csv('c2018_a.csv')
df_completions_17 = read_csv('c2017_a.csv')
df_completions_16 = read_csv('c2016_a.csv')
df_completions_15 = read_csv('c2015_a.csv')
df_completions_14 = read_csv('c2014_a.csv')
df_completions_13 = read_csv('c2013_a.csv')
df_completions_12 = read_csv('c2012_a.csv')
df_completions_11 = read_csv('c2011_a.csv')
df_completions_10 = read_csv('c2010_a.csv')

df_hd_14 <- read_csv('hd2014.csv')
```

These are fairly large tables. For example, df_effy has 16688 rows and df_hd has 7687. Lets take a peek: 
```{r}
head(df_effy_14)
head(df_hd_14)
head(df_completions_14)
```

The documentation for the data files explains the somewhat cryptic abbreviations used for column labels. For the EFFY surveys, the really important columns are UNITID (institutional ID number), EFYTOTLT (grand total fall enrollment), EFYTOTLM (total men), EFTYTOTLW (total women), LSTUDY (level of study -- undergraduate 1, graduate 3, or combined 999). 

Let's get some figures from some of these tables. 

```{r}
enrollment_2018 <-select(df_effy_18,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2017 <-select(df_effy_17,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2016 <-select(df_effy_16,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2015 <-select(df_effy_15,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2014 <-select(df_effy_14,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2013 <-select(df_effy_13,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2012 <-select(df_effy_12,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2011 <-select(df_effy_11,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
enrollment_2010 <-select(df_effy_10,UNITID, EFYTOTLT,EFYTOTLM,EFYTOTLW,LSTUDY)
```

Lets rename some of the data in these tables
```{r}
enrollment_2018 <- rename(enrollment_2018,"university_id" = UNITID, "total_enrollment_2018"=EFYTOTLT,"men_enrollment_2018"=EFYTOTLM, "women_enrollment_2018"=EFYTOTLW,"level_study"=LSTUDY)

enrollment_2017 <- rename(enrollment_2017,"university_id" = UNITID, "total_enrollment_2017"=EFYTOTLT,"men_enrollment_2017"=EFYTOTLM, "women_enrollment_2017"=EFYTOTLW,"level_study"=LSTUDY)

enrollment_2016 <- rename(enrollment_2016,"university_id" = UNITID, "total_enrollment_2016"=EFYTOTLT,"men_enrollment_2016"=EFYTOTLM, "women_enrollment_2016"=EFYTOTLW,"level_study"=LSTUDY)

enrollment_2015 <- rename(enrollment_2015,"university_id" = UNITID, "total_enrollment_2015"=EFYTOTLT,"men_enrollment_2015"=EFYTOTLM, "women_enrollment_2015"=EFYTOTLW,"level_study"=LSTUDY)

enrollment_2014 <- rename(enrollment_2014,"university_id" = UNITID, "total_enrollment_2014"=EFYTOTLT,"men_enrollment_2014"=EFYTOTLM, "women_enrollment_2014"=EFYTOTLW,"level_study"=LSTUDY)
enrollment_2013 <- rename(enrollment_2013,"university_id" = UNITID,"total_enrollment_2013"=EFYTOTLT,"men_enrollment_2013"=EFYTOTLM, "women_enrollment_2013"=EFYTOTLW,"level_study"=LSTUDY)
enrollment_2012 <- rename(enrollment_2012,"university_id" = UNITID,"total_enrollment_2012"=EFYTOTLT,"men_enrollment_2012"=EFYTOTLM, "women_enrollment_2012"=EFYTOTLW,"level_study"=LSTUDY)
enrollment_2011 <- rename(enrollment_2011,"university_id" = UNITID,"total_enrollment_2011"=EFYTOTLT,"men_enrollment_2011"=EFYTOTLM, "women_enrollment_2011"=EFYTOTLW,"level_study"=LSTUDY)
enrollment_2010 <- rename(enrollment_2010,"university_id" = UNITID,"total_enrollment_2010"=EFYTOTLT,"men_enrollment_2010"=EFYTOTLM, "women_enrollment_2010"=EFYTOTLW,"level_study"=LSTUDY)
```
Let's take a gander at these relabeled data sets 
```{r}
head(enrollment_2018)
```

We can see that the enrollment dataframes represent each college three times, split up by level of study. (We could further tidy the data so that it was split up by level of study AND gender, but we've left it this way for now.)

The next step is to combine all the enrollment data into a single data frame. We accomplish this with a pipeline of joins on the university id and level of study. The idea is that in the total_enrollment table each university is represented by three rows: one with undergraduate enrollment, one with graduate enrollment, and one with combined enrollment. Each row has data for the 9 years in question (2010-2018), which is further split up by gender.

```{r}
total_enrollment <- enrollment_2010 %>%
  inner_join(enrollment_2011,by=c("university_id","level_study")) %>%
  inner_join(enrollment_2012, by =c("university_id","level_study"))%>%
  inner_join(enrollment_2013,by=c("university_id","level_study"))%>%
  inner_join(enrollment_2014,by=c("university_id","level_study"))%>%
  inner_join(enrollment_2015,by=c("university_id","level_study")) %>%
  inner_join(enrollment_2016, by =c("university_id","level_study"))%>%
  inner_join(enrollment_2017,by=c("university_id","level_study"))%>%
  inner_join(enrollment_2018,by=c("university_id","level_study"))
head(total_enrollment)
```

This table is quite nice -- it contains the data for each university. Let's split it up into combined/undergrad/grad tables. I mainly did this just as an exercise in pipeline and filtering. 

```{r}

total_enrollment_combined <- total_enrollment %>%
  filter(level_study==999) %>%
  select(-level_study)

total_enrollment_undergrad <- total_enrollment %>%
  filter(level_study==1)%>%
  select(-level_study)

total_enrollment_grad <- total_enrollment %>%
  filter(level_study==3)%>%
  select(-level_study)
```

Now if we examine our new tables they have the data nicely divided: 
```{r}
head(total_enrollment_combined)
head(total_enrollment_undergrad)
head(total_enrollment_grad)
```




We can now do a bit of data exploration toward answering the questions listed at the beginning of the document. Simple summary statistics like average enrollment will be somewhat limited in application because the institutions vary a lot in size. Averaging changes between institutions of different sizes is also a somewhat shaky approach -- a 2 percent dip in a huge university might represent a much bigger change than a 20 percent dip in a tiny college. Therefore the best idea seems to be to combine all numbers across all the schools. 


```{r}
enrollment_combined_no_gender <- total_enrollment_combined %>%
  select(total_enrollment_2010,total_enrollment_2011,total_enrollment_2012,
         total_enrollment_2013,total_enrollment_2014,total_enrollment_2015,total_enrollment_2016,total_enrollment_2017,
         total_enrollment_2018)
combined_enrollment_figs <- colSums(enrollment_combined_no_gender)
years <- c(2010,2011,2012,2013,2014,2015,2016,2017,2018)
combined_enrollment_figs_df <- data.frame(years,combined_enrollment_figs)
combined_enrollment_viz <- ggplot(data=combined_enrollment_figs_df,aes(x=years,y=combined_enrollment_figs))+
  geom_point()+
  geom_line() +
  theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank())+
  labs(title="Combined Enrollment (2010-2018)",x="Year",y="Combined enrollment (all institutions)")
ggsave(filename="combined_enrollment_viz.png",plot=combined_enrollment_viz,dpi=800)
combined_enrollment_viz
```

Okay: the upshot is that combined (undergrad + grad) enrollment is going down. 

Now let's look at undergrad enrollment over the same period: 

```{r}
enrollment_undergrad_no_gender <- total_enrollment_undergrad %>%
  select(total_enrollment_2010,total_enrollment_2011,total_enrollment_2012,
         total_enrollment_2013,total_enrollment_2014,total_enrollment_2015,
         total_enrollment_2016,total_enrollment_2017,total_enrollment_2018)

undergrad_enrollment_figs <- colSums(enrollment_undergrad_no_gender)

years <- c(2010,2011,2012,2013,2014,2015,2016,2017,2018)
undergrad_enrollment_figs_df <- data.frame(years,undergrad_enrollment_figs)
undergrad_enrollment_viz <- ggplot(data=undergrad_enrollment_figs_df,aes(x=years,y=undergrad_enrollment_figs))+
  geom_point()+
  geom_line() +theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank())+
  labs(title="Undergrad Enrollment (2010-2018)",x="Year",y="Undergrad enrollment (all institutions)")
ggsave(filename="undergrad_enrollment_viz.png",plot=undergrad_enrollment_viz,dpi=800)
undergrad_enrollment_viz
```

The undergraduate trend follows the same trend as the combined trend (no surprises there). 

It's less important, but we can also take a look at graduate enrollment over the same period. 
```{r}
enrollment_grad_no_gender <- total_enrollment_grad %>%
  select(total_enrollment_2010,total_enrollment_2011,total_enrollment_2012,
         total_enrollment_2013,total_enrollment_2014,total_enrollment_2015,
         total_enrollment_2016,total_enrollment_2017,total_enrollment_2018)
grad_enrollment_figs <- colSums(enrollment_grad_no_gender)
years <- c(2010,2011,2012,2013,2014,2015,2016,2017,2018)
grad_enrollment_figs_df <- data.frame(years,grad_enrollment_figs)
grad_enrollment_viz <- ggplot(data=grad_enrollment_figs_df,aes(x=years,y=grad_enrollment_figs))+
  geom_point()+
  geom_line() +
  theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank())+
  labs(title="Graduate Enrollment (2010-2018)",x="Year",y="Graduate enrollment (all institutions)")
ggsave(filename="grad_enrollment_viz.png",plot=grad_enrollment_viz,dpi=800)
grad_enrollment_viz
```

If we juxtapose the three images, we see that the loss in overall enrollment seems to be driven by loss in undergraduate enrollment, and that the graduate enrollment rallied slightly starting in 2013. Since graduate enrollment represents about 10 percent of the combined enrollment, the rally in graduate enrollment does not balance out the diminishing undergraduate enrollment. 
```{r}
combined_enrollment_viz
undergrad_enrollment_viz
grad_enrollment_viz
```

The overall trend is: undergraduate enrollment is going down. 


The EFFY survey data is split up by gender, so we can try to see if there are any interesting trends by gender. 

```{r}
enrollment_combined_men <- total_enrollment_combined %>%
  select(men_enrollment_2010,men_enrollment_2011,men_enrollment_2012,
         men_enrollment_2013,men_enrollment_2014,men_enrollment_2015,
         men_enrollment_2016,men_enrollment_2017,men_enrollment_2018)
combined_men_enrollment_figs <- colSums(enrollment_combined_men)
years <- c(2010,2011,2012,2013,2014,2015,2016,2017,2018)
combined_enrollment_men_figs_df <- data.frame(years,combined_men_enrollment_figs)
combined_enrollment_men_viz <- ggplot(data=combined_enrollment_men_figs_df,aes(x=years,y=combined_men_enrollment_figs))+
  geom_point()+
  geom_line() +
  labs(title="Combined Mens Enrollment (2010-2018)",x="Year",y="Combined mens enrollment (all institutions)")+
  theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank())
ggsave(filename="combined_men_enrollment_viz.png",plot=combined_enrollment_men_viz,dpi=800)
combined_enrollment_men_viz
```

Not exactly a shock: the combined mens enrollment is going down quite sharply (a bit more sharply than the mens+womens combined enrollment). 

```{r}
enrollment_combined_women <- total_enrollment_combined %>%
  select(women_enrollment_2010,women_enrollment_2011,women_enrollment_2012,
         women_enrollment_2013,women_enrollment_2014,women_enrollment_2015,
         women_enrollment_2016,women_enrollment_2017,women_enrollment_2018)
combined_women_enrollment_figs <- colSums(enrollment_combined_women)
years <- c(2010,2011,2012,2013,2014,2015,2016,2017,2018)
combined_enrollment_women_figs_df <- data.frame(years,combined_women_enrollment_figs)
combined_enrollment_women_viz <- ggplot(data=combined_enrollment_women_figs_df,aes(x=years,y=combined_women_enrollment_figs))+
  geom_point()+
  geom_line() +
  labs(title="Combined Womens Enrollment (2010-2018)",x="Year",y="Combined womens enrollment (all institutions)")+
  theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank())
ggsave(filename="combined_women_enrollment_viz.png",plot=combined_enrollment_women_viz,dpi=800)
combined_enrollment_women_viz
```
This gives us a clue that a slight recovery in enrollment has been driven by women returning to college before men. 




Now we want this to include some other data: geographical region (this is from the HD data set) and discplines. 
```{r}
regions = select(df_hd_14,"university_id"=UNITID,"region_code"=OBEREG)
head(regions)
```
The university_id's are familiar to us, and the region_code's are as follows (this can be found on the Excel files included in the data download). 

0 - US Service schools (such as West Point or the Merchant Marine Academy)
1 - New England (CT ME MA NH RI VT)
2 - Mid East (DE DC MD NJ NY PA)
3 - Great Lakes (IL IN MI OH WI)
4 - Plains (IA KS MN MO NE ND SD)
5 - Southeast (AL AR FL GA KY LA MS NC SC TN VA WV)
6 - Southwest (AZ NM OK TX)
7 - Rocky Mountains (CO ID MT UT WY)
8 - Far West (AK CA HI NV OR WA)
9 - Outlying areas (AS FM GU MH MP PR PW VI)
-3 - Not available

Sidenote: it's debatable if these regions are ideal. New England is well-defined and conventional, but, e.g., bundling Minnesota with Kansas instead of with Wisconsin is maybe debatable.  

Now we want to combine the regions data with the enrollment data. 
The following code combines the enrollment data from the EFFY survey with the region codes so that we can analyze regional trends. 

```{r}
total_enrollment_regions <- total_enrollment %>%
  inner_join(regions, by = "university_id")
combined_enrollment_regions <- total_enrollment_regions %>%
  filter(level_study==999)


combined_no_gender_enrollment_regions <- combined_enrollment_regions %>%
  select(region_code,total_enrollment_2010,total_enrollment_2011,total_enrollment_2012,
         total_enrollment_2013,total_enrollment_2014,total_enrollment_2015,
         total_enrollment_2016,total_enrollment_2017,total_enrollment_2018)
```

We haven't totally tidied the data, but it's good enough for our purposes (the ideal tidied data table contains a single measurement in each row, so we wouldn't have enrollment figures from multiple years in the same row). The following chunk serves to find the total enrollment per region as a function of the year. 

```{r}
combined_no_gender_enrollment_regions_by_regions <- combined_no_gender_enrollment_regions %>%
  group_by(region_code)%>%
  transmute("2010"=sum(total_enrollment_2010),
         "2011"=sum(total_enrollment_2011),
         "2012"=sum(total_enrollment_2012),
         "2013"=sum(total_enrollment_2013),
         "2014"=sum(total_enrollment_2014),
         "2015"=sum(total_enrollment_2015),
         "2016"=sum(total_enrollment_2016),
         "2017"=sum(total_enrollment_2017),
         "2018"=sum(total_enrollment_2018)) %>%
  distinct(region_code,.keep_all = TRUE) %>%
  arrange(region_code)
reshaped_regions_no_gender <- combined_no_gender_enrollment_regions_by_regions %>%
  gather('2010','2011','2012','2013','2014','2015','2016','2017','2018',key='year',value='enrollment')
```


Lets take a look at this new table: it contains only three variables (region_code, year, and enrollment), and each row represents a single enrollment figure.  

```{r}
head(reshaped_regions_no_gender)
```

Let's get into the plotting of this data. First we make a dataframe. 
```{r}
reshaped_regions_no_gender_df <- data.frame(reshaped_regions_no_gender)
reshaped_regions_no_gender_df
```
The region code 0 stands for US Service Academies like West Point, where enrollment is basically constant (as well as being a tiny fraction of any of the other regions). The same thing is true for the outlying areas such as Guam. So we might as well filter it out 

```{r}
regions_no_gender_df <- 
  reshaped_regions_no_gender_df %>%
  filter(region_code != 0 & region_code != 9)
head(regions_no_gender_df)
```




Now we can create our list of objects. 


```{r}
labels_regions <-  c('New England (CT ME MA NH RI VT)', 
                     'Mid East (DE DC MD NJ NY PA)',
                     'Great Lakes (IL IN MI OH WI)', 
                     'Plains (IA KS MN MO NE ND SD)',             
                     'Southeast (AL AR FL GA KY LA MS NC SC TN VA WV)',
                     'Southwest (AZ NM OK TX)',
                     'Rocky Mountains (CO ID MT UT WY)', 
                     'Far West (AK CA HI NV OR WA)')
```
Let's build a more concise set of labels: 
```{r}
labels_regions_concise <- c('New England', 
                            'Mid East',
                            'Great Lakes', 
                            'Plains', 
                            'Southeast',
                            'Southwest',
                            'Rocky Mountains', 
                            'Far West')
```


The following is a ggplot visualization. 
```{r}
regions_viz <- ggplot(data=regions_no_gender_df,aes(x=strtoi(year)-2000,y=enrollment/10^6,color=factor(region_code),group=factor(region_code),label=factor(region_code)))+
  geom_point()+
  geom_line(key_glyph="rect")+
  theme_light()+
  theme(legend.title = element_text(color="chocolate",size=16,face="bold"),
        legend.background = element_rect(fill="white",
                                  size=0.5, linetype="solid", 
                                  colour ="darkblue"),
        legend.key = element_rect(fill = "white", colour = "black"),
        panel.grid.major=element_blank(),
        panel.grid.minor=element_blank())+
  scale_color_hue(name="Region",
                       labels=labels_regions_concise)+
  labs(y="Enrollment (in millions)",x="Year",title="Regional enrollment trends: 2010-2018")
ggsave(filename="regional_trends.png",plot=regions_viz,dpi=800)
regions_viz
```

That gives us a picture of some of the trends. 

New England is static.

Far West, after a dip in 2013, seems to be recovering. 

Rocky Mountains is increasing at a modest but steady rate.

The Great Lakes region is decreasing. 



=======DISCIPLINARY TRENDS========

Okay, next we would like to get some disciplinary information added in. Remember that these are stored in the completions tables: 
```{r}
head(df_completions_14)
```
The important columns here are UNITID (university ID code), CIPCODE (discplinary subject), CTOTALT (total completions in the discipline). 

We have a list of CIPCODEs of the form xx.xxxx. The first two digits store the broad region (such as Biology or Education) and the last four indicate the specialty (such as Biochemistry or Wildlife Biology). 

At first, we want to study broad trends, so we are only interested in the trends at the xx.-level. We need to split up the CIP code by the '.' delimiter and then grab the first part. 




```{r}
completions_split_10 <- df_completions_10 %>%
  mutate(coarsecip = substr(CIPCODE,1,2),
         subcip = substr(CIPCODE,2,2))
completions_coarse_10 <- completions_split_10 %>%
  select(UNITID,coarsecip,CTOTALT)
```
Lets see if that worked: 
```{r}
head(completions_coarse_10)
```
Not too bad! But we only want the totals grouped by CIP -- we don't really care about the UNITID: 
```{r}
completions_coarse_by_cip_10 <- completions_coarse_10 %>%
  group_by(coarsecip) %>%
  summarize("2010" = sum(CTOTALT))
  completions_coarse_by_cip_10
```
That's what we wanted: the total number of completed degrees by CIPCode. In the chunk below we have repeated this operation for each of the other years in our dataset. 

```{r}
completions_split_11 <- df_completions_11 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_11 <- completions_split_11 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2011" = sum(CTOTALT))
  
completions_split_12 <- df_completions_12 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_12 <- completions_split_12 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2012" = sum(CTOTALT))

completions_split_13 <- df_completions_13 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_13 <- completions_split_13 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2013" = sum(CTOTALT))

completions_split_14 <- df_completions_14 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_14 <- completions_split_14 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2014" = sum(CTOTALT))

completions_split_15 <- df_completions_15 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_15 <- completions_split_15 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2015" = sum(CTOTALT))

completions_split_16 <- df_completions_16 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_16 <- completions_split_16 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2016" = sum(CTOTALT))

completions_split_17 <- df_completions_17 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_17 <- completions_split_17 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2017" = sum(CTOTALT))

completions_split_18 <- df_completions_18 %>%
  mutate(coarsecip = substr(CIPCODE,1,2))
completions_coarse_by_cip_18 <- completions_split_18 %>%
  select(UNITID,coarsecip,CTOTALT) %>%
  group_by(coarsecip) %>%
  summarize("2018" = sum(CTOTALT))
```

We now have reasonably organized completion data for all the programs. The next step is to combine it into a single data frame.

We want the coarse_cip completion data to be organized into a single table. 

```{r}
all_completions_cip <- completions_coarse_by_cip_10 %>%
  merge(completions_coarse_by_cip_11) %>%
  merge(completions_coarse_by_cip_12) %>%
  merge(completions_coarse_by_cip_13) %>%
  merge(completions_coarse_by_cip_14) %>%
  merge(completions_coarse_by_cip_15) %>%
  merge(completions_coarse_by_cip_16) %>%
  merge(completions_coarse_by_cip_17) %>%
  merge(completions_coarse_by_cip_18)
all_completions_cip 
```

This is nice, but we'd really like to have the years as variables, in order to comply with the tidy data paradigm. (We could have done this earlier.)

```{r}
reshaped_completions <- all_completions_cip %>%
  gather("2010","2011","2012","2013","2014","2015","2016","2017","2018",key="year",value="completions")
reshaped_completions
```

This is properly tidied coarse_cip completion data by year. 

There is one that we would like to exclude: Cip code 99, which describes the grand total of all degree completions.

```{r}
reshaped_completions <- reshaped_completions %>%
  filter(coarsecip != '99')
```


Now we can do a big facet graph, plotting for each CIPCODE a different line graph showing completions of degrees in that CIPCODE over the years 2010-2018. 

WARNING: the first version is going to look very rough in RStudio. However the version that saves to the directory is not as cluttered. 

```{r}
bad_facets <- ggplot(data = reshaped_completions, aes(year, completions,group=1)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point(color="steelblue") + 
  labs(title = "Degree completions by CIP code (2010-2018)",
       y = "Count of degree completions", x = "year") + 
  facet_wrap(~coarsecip)

ggsave(filename='bad_facets.png',plot=bad_facets,dpi=800)
bad_facets
```

This gives us some good trends (09, 11, 14, 24, 51, 52) all seem to be increasing. There is one thing we can do to make it a bit more meaningful: swap out cip codes for names (which I've summarized because the CIPcode descriptions are sometimes lengthy). 

```{r}
codes_number <- c(1,3,4,5,9,10,11,12,13,14,15,16,19,22,23,24,25,26,27, 29,30,31,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,54)
codes_chr <- as.character(codes_number)
cip_data <- data.frame("coarsecip" = codes_chr, "area" = c("agri","envsci","arch/urb","area studies","comms/journ","graph des.","comp.","beauty","educ","engns","technician","langs","human sci","law","lit","lib arts","library","bio","math","intel","gen studies","mgmt","phil","rel","chem","sci tech","psych","responder","admin","soc sci","trades","assorted tech","prod trade","air traff","art","health","bus/fin","history"))
cip_data
```

Let's get the area names in there
```{r}
reshaped_completions
reshaped_completions_areas <- reshaped_completions %>%
  merge(cip_data) %>%
  arrange(desc(coarsecip))
reshaped_completions_areas
```
Now lets do the same facet wrap: 

```{r}
discipline_viz <- ggplot(data = reshaped_completions_areas, aes(year, completions,group=1)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point(color="steelblue") + 
  labs(title = "Degree completions by area (2010-2018)",
       y = "Count of degree completions", x = "year") +
  
  facet_wrap(~area)+
  theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank())
ggsave(filename='discipline_graph.png',plot=discipline_viz,dpi=400)

discipline_viz
```
This is nice (it is tiny in Rstudio but I recommend looking at the saved image discipline_graph.png in the directory), but it washes out a lot of the trends for the lower-enrollment areas. So we might want to look at specific disciplines. 

Let's create a new dataframe that tracks something very coarse: percent growth in a field over the time interval 2010-2018. We calculate this as 100*(change in completions)/(2010 completions).  
```{r}
head(reshaped_completions_areas)
net_completions <-  reshaped_completions_areas %>% 
  filter(year == 2010 | year == 2018) %>%
  spread(year, completions) 
colnames(net_completions) <- c('coarsecip','area','comps_2010','comps_2018')
net_completions <- net_completions %>%
  mutate(percent_change = 100*(comps_2018/comps_2010-1)) %>%
  arrange(desc(percent_change))
head(net_completions,10)
```
We see that the 10 most increasing areas are those listed above To clarify these abbreviations: sci tech covers basically any degree preparing for work as a technician in a lab, comp. is computer science, prod trade is degrees related to production like manufacturing, mgmt is management science, engns is engineering.

Some of these are not surprising (nobody can be shocked by bronze medal for computer areas); some are more surprising (liberal arts has a lot more completions despite a lot of press about the death of liberal arts degrees). 

Answer to one of our questions: A comparatively small investment in educational resources could return major benefits; also more math and computer science resources.



The visualization below makes this a bit more striking. The red line represents a 0 percent change in the number of degree completions. 
```{r}
percents_viz <- ggplot(data=net_completions,aes(x=area,y=percent_change))+
  geom_point()+
  theme(axis.text.x = element_text(angle = -90,vjust=0.45),
        panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        panel.background = element_blank())+
  geom_hline(yintercept=0,color='red')+
  labs(title="Net change in degree completion: 2010-2018",x='area',y='percent change in enrollment')
percents_viz
```

This is a bit cluttered: let's pin down the 10 fastest-growing and 10 fastest-shrinking areas. 

```{r}
ten_fastest <- net_completions %>%
  arrange(desc(percent_change)) %>%
  slice(1:10)
#ten_fastest

ten_worst <- net_completions %>%
  arrange(percent_change) %>%
  slice(1:10)
#ten_worst

fastest_viz <- ggplot(data=ten_fastest,aes(x=area,y=percent_change))+
  geom_point()+
  theme(axis.text.x = element_text(angle = 30))+
  geom_hline(yintercept=0,color='red')+
  theme(axis.text.x = element_text(angle = -90,vjust=0.45),
        panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        panel.background = element_blank())+
  labs(title="Fastest areas of growth: 2010-2018",x='area',y='percent (+) change in enrollment')+
  ggsave('fastest_viz.png',dpi=800)
slowest_viz <- ggplot(data=ten_worst,aes(x=area,y=percent_change))+
  geom_point()+
  theme(axis.text.x = element_text(angle = 30))+
  theme(axis.text.x = element_text(angle = -90,vjust=0.45),
        panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        panel.background = element_blank())+
  geom_hline(yintercept=0,color='red')+
  labs(title="Weakest areas of growth: 2010-2018",x='area',y='percent change in enrollment')+
  ggsave('weakest_viz.png',dpi=800)
fastest_viz
slowest_viz
```
We saw that there was some good growth in the mathematics area. Anyone who follows higher-educational news in mathematics will know that statistics is growing in popularity as a major course of study. However, the Completions survey does not provide a separate "Coarse" CIP for stats; it's stuck within the completions for math majors. Let's see if we can extract some useful information about the completion of statistics degrees. 


Remember what the non-coarse completions data looked like: 

```{r}
head(df_completions_10)
```
We still want to break apart the CIPCODE but now we want to filter to the Coarse CIP 27 -- this is where all the mathematics is. Here are the fine CIPCODES for Mathematics (I started converting the codes to character by hand before I remembered you can cast to character vectorwise)
```{r}
subfields_names <- c("Mathematics, General",
                     "Algebra and Number Theory",
                     "Analysis and Functional Analysis",
                     "Topology and Foundations",
                     "Mathematics, Other",
                     "Applied Mathematics (General)",
                     "Computational Mathematics",
                     "Computational and Applied Mathematics",
                     "Financial Mathematics",
                     "Mathematical Biology",
                     "Applied Mathematics (Other)",
                     "Statistics",
                     "Math-stats and probability",
                     "Math-stats",
                     "Statistics (Other)",
                     "Math-stats (Other)")
subfields_codes <- c(27.0101,
                     27.0102,
                     27.0103,
                     27.0105,
                     27.019,
                     27.0301,
                     27.0303,
                     27.0304,
                     27.0305,
                     27.0306,
                     27.0399,
                     27.0501,
                     27.0502,
                     27.0503,
                     27.0599,
                     27.9999)
subfields_codes <- as.character(subfields_codes)
subfields_df <- data.frame("subfield"=subfields_names,"CIPCODE"=subfields_codes)
subfields_df
```

If like me you find the list of subfields kind of ridiculous and redundant, don't worry, we will simplify it later on, after we have summarized a bit. 


We have the split dataframes. Here the math completions will have coarsecip = 27 and the subcip will indicate what specialty the degree was in (like stats or math-econ).  
```{r}
completions_split_10 %>% 
  select(UNITID,CTOTALT,coarsecip) %>%
  filter(coarsecip == '27')
```
These are quite long, with separate completion data by gender and ethnic lines. 


```{r}
completions_split_10 %>%
  filter(coarsecip == '27') %>%
  group_by(CIPCODE) %>%
  summarise("2010"=sum(CTOTALT))
```


```{r}
math_completions_10 <- completions_split_10 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2010'=sum(CTOTALT))

math_completions_11 <- completions_split_11 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2011'=sum(CTOTALT))

math_completions_12 <- completions_split_12 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2012'=sum(CTOTALT))

math_completions_13 <- completions_split_13 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2013'=sum(CTOTALT))

math_completions_14 <- completions_split_14 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2014'=sum(CTOTALT))

math_completions_15 <- completions_split_15 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2015'=sum(CTOTALT))

math_completions_16 <- completions_split_16 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2016'=sum(CTOTALT))

math_completions_17 <- completions_split_17 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2017'=sum(CTOTALT))

math_completions_18 <- completions_split_18 %>%
  filter(coarsecip == '27') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2018'=sum(CTOTALT))
```


Let's take a peep: 

```{r}
math_completions_18
```



Now we have all the data for all the subfields for the years 2010-2018 stored in nine separate tables. Again we need to merge them 
```{r}
math_completions <- math_completions_10 %>%
  merge(math_completions_11,how='full') %>%
  merge(math_completions_12,how='full') %>%
  merge(math_completions_13,how='full') %>%
  merge(math_completions_14,how='full') %>%
  merge(math_completions_15,how='full') %>%
  merge(math_completions_16,how='full') %>%
  merge(math_completions_17,how='full') %>%
  merge(math_completions_18,how='full')
head(math_completions)
```
Now we're cooking! We ust need to tidy up the data. 

Let's add in a column of all the subfield names, then tidy along the year axis. 

```{r}
math_completions <- math_completions %>%
  merge(subfields_df)
math_completions <- math_completions %>%
  gather("2010","2011","2012","2013","2014","2015","2016","2017","2018",key="year",value="completions")
math_completions
```



Before we visualize, it will be handy to have abbreviated names for each subfield. 

```{r}
subfields_df
```


```{r}
abbr_math_subfield <- c('math',
                        'algebra',
                        'analysis',
                        'topology',
                        'mathother',
                        'applied',
                        'compmath',
                        'compmath+appl',
                        'finmath',
                        'mathbio',
                        'applmathoth',
                        'stat',
                        'mathstatprob',
                        'mathstat',
                        'statother',
                        'mathstatother')
```

Let's update the subfields list with these abbreviated names:

```{r}
subfields_df$abbrev <- abbr_math_subfield
```


In preparing this list, it jumps out at you how there are actually fewer real subfields: math, aaplied/computational math, stats. 

Lets write a crude-subfield functions

```{r}
crude_subfield <- function(t) {
  if (t %in%
      c('applied',
        'compmath',
        'compmath+appl',
        'finmath',
        'mathbio',
        'applmathoth')){
    'applied'
  }
  else if (t %in%
      c('math',
        'algebra',
        'analysis',
        'topology',
        'mathother')){
    'math'
  }
  else if (t %in% 
      c('stat',
        'mathstatprob',
        'mathstat',
        'statother',
        'mathstatother')){
    'stats'
  }
  }
```
Let's test our function

```{r}
crude_subfield('analysis')
crude_subfield('statother')
crude_subfield('mathbio')
```

Now lets add a crude subfield column to the subfields dataframe

```{r}

```
```{r}
subfields_df_dec <- subfields_df %>% 
  mutate(crude = map_chr(abbrev,crude_subfield))

```

```{r}
math_completions_dec <- math_completions %>%
  merge(subfields_df_dec, on = 'CIPCODE')
summarised_math_completions <- math_completions_dec %>%
  group_by(crude,year) %>%
  summarise('total_completions'=sum(completions))
```



Let's visualize. 



```{r}
math_subfield_viz <-  ggplot(data = summarised_math_completions, aes(strtoi(year)-2000, total_completions)) +
  geom_point(color="steelblue") + 
  labs(title = "Mathematics degrees by subfield (2010-2018)",
       y = "Count of degree completions", x = "year") + 
  theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank())+
  facet_wrap(~crude)
ggsave('math_subfield_viz.png',dpi=400)
math_subfield_viz
```
With the disclaimer that these are pretty crude bins, you can see strong growth across the board, with the applied fields (which includes computational math, applied math, and math bio) growing the fastest in recent years. 

Let's look at percent changes: 

```{r}
math_completions_dec
```


```{r}
net_math_completions <-  math_completions_dec %>% 
  filter(year == 2010 | year == 2018) %>%
  select(year,completions,abbrev,crude)
net_math_completions <- net_math_completions %>%
  spread(year,completions)
colnames(net_math_completions)=c('subfield','crude_subfield','math_comps_10','math_comps_18')
net_math_completions <- net_math_completions %>%
  mutate(percent_change = 100*(math_comps_18/(math_comps_10+1)-1), net_change = math_comps_18-math_comps_10) %>%
  arrange(desc(percent_change))
net_math_completions$percent_change
v <- net_math_completions$percent_change
v[1] <- 100
net_math_completions$percent_change <- v
net_math_completions
```
We can also do the crude versiomn of this. 

```{r}
net_math_crude_completions <- net_math_completions %>%
  group_by(crude_subfield) %>%
  summarise(total_completions_10 = sum(math_comps_10),
            total_completions_18 = sum(math_comps_18))

net_math_crude_completions <- net_math_crude_completions %>%
  mutate(net_change = total_completions_18-total_completions_10,
         pct_change = (total_completions_18-total_completions_10)/total_completions_10)
```


Let's visualize these growth rates: 
```{r}
pct_math_scatter <- ggplot(data=net_math_completions,aes(x=subfield,y=percent_change))+
  geom_point()+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90),
        panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        axis.ticks = element_blank())+
  geom_hline(yintercept=0,color='red')+
  labs(title="Math subfield growth (by pct): 2010-2018",x='area',y='pct change in degree completions')+
  ggsave('math_fastest_viz.png',dpi=800)+
  coord_flip()
pct_math_scatter

pct_math_scatter <- ggplot(data=net_math_completions,aes(x=subfield,y=net_change))+
  geom_point()+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90),
        panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        axis.ticks = element_blank())+
  geom_hline(yintercept=0,color='red')+
  labs(title="Math subfield growth (net): 2010-2018",x='area',y='net change in degree completions')+
  ggsave('math_fastest_viz.png',dpi=800)+
  coord_flip()
pct_math_scatter
```



Let's look at how the distributions have shifted over the period. (I know that pie charts have been deprecated )

```{r}
net_math_completions
```

Let's do some tidying

```{r}
net_math_completions <- net_math_completions %>%
  rename('2010'='math_comps_10','2018'='math_comps_18')
net_math_completions
```


```{r}
tidy_math_completions <- net_math_completions %>%
  gather('2010','2018',key='year',value='completions')
tidy_math_completions
```


```{r}
bar_2010 <- ggplot(tidy_math_completions,aes(x=year,y=completions,fill=subfield))+
  geom_bar(position="dodge",
           stat="identity")+
  theme_bw()
bar_2010
```

Lets crudify it and switch to proportions.

```{r}
crude_bar_math <- ggplot(tidy_math_completions,aes(x=year,y=completions,fill=crude_subfield))+
  geom_bar(position="fill",
           stat="identity",
           normalize = TRUE)+
  theme_bw()+
  ylab('Proportion of all math degree completions')
ggsave("crude_bar_math.png",crude_bar_math,dpi=800)
crude_bar_math

```

This shows that stats and applied fields are really growing within the math major, although we have already seen that there is a considerable amount of growth in general mathematics degrees. Any company that is trying to sell books and software to these students should look into publishing more applied and statistics titles. 



======== UNFINISHED ANALYSIS OF OF COMPLETIONS DATA FOR COMPUTER SCIENCE ====


Below we sketch how to begin carrying out the same analysis for the Computer Science category (CoarseCIP: 11). The interested reader can update some of the commands and finish it. 

```{r}
subfields_names <- c("Gen CS","AI","IT","Informatics","CS (Other)",
                     "Gen program","Spec. program",
                     "Cert program", "Other program",
                     "Data processing","Info sci",
                     "Comp sys analyst","Data entry",
                     "Word proc","Other data entry",
                     "CS","Web page design","Data model",
                     "Graphics","Modelling","Software",
                     "Networks/Telecom","Network admin",
                     "LAN admin","Security","Webmaster",
                     "IT Project","Comp supp","IT Admin",
                     "Other CS Supp")
subfields_codes <- c(11.0101, 11.0102, 11.0103, 11.0104,
                     11.0199, 11.0201, 11.0202, 11.0203,
                     11.0299, 11.0301, 11.0401, 11.0501,
                     11.0601, 11.0602, 11.0699, 11.0701,
                     11.0801, 11.0802, 11.0803, 11.0804,
                     11.0899, 11.0901, 11.1001, 11.1002,
                     11.1003, 11.1004, 11.1005, 11.1006,
                     11.1099, 11.9999)
subfields_codes <- as.character(subfields_codes)
subfields_df <- data.frame("subfield"=subfields_names,"CIPCODE"=subfields_codes)
subfields_df

cs_completions_10 <- df_completions_10 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2010'=sum(CTOTALT))

cs_completions_11 <- df_completions_11 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2011'=sum(CTOTALT))

cs_completions_12 <- df_completions_12 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2012'=sum(CTOTALT))

cs_completions_13 <- df_completions_13 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2013'=sum(CTOTALT))

cs_completions_14 <- df_completions_14 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2014'=sum(CTOTALT))

cs_completions_15 <- df_completions_15 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2015'=sum(CTOTALT))

cs_completions_16 <- df_completions_16 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2016'=sum(CTOTALT))

cs_completions_17 <- df_completions_17 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2017'=sum(CTOTALT))

cs_completions_18 <- df_completions_18 %>%
  mutate(spltcip = substr(CIPCODE,1,2)) %>%
  filter(spltcip == '11') %>%
  select(CIPCODE, CTOTALT)%>%
  group_by(CIPCODE) %>%
  summarize('2018'=sum(CTOTALT))
```

```{r}
cs_completions <- cs_completions_10 %>%
  merge(cs_completions_11,how='full') %>%
  merge(cs_completions_12,how='full') %>%
  merge(cs_completions_13,how='full') %>%
  merge(cs_completions_14,how='full') %>%
  merge(cs_completions_15,how='full') %>%
  merge(cs_completions_16,how='full') %>%
  merge(cs_completions_17,how='full') %>%
  merge(cs_completions_18,how='full')


cs_completions <- cs_completions %>%
  merge(subfields_df) %>%
  select(-CIPCODE)

cs_completions <- cs_completions %>%
  gather("2010","2011","2012","2013","2014","2015","2016","2017","2018",key="year",value="completions")
cs_completions
```
Finally we can visualize: 


```{r}
cs_subfield_viz <-  ggplot(data = cs_completions, aes(year, completions,group=1)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point(color="steelblue") + 
  labs(title = "CS degrees by subfield (2010-2018)",
       y = "Count of degree completions", x = "year") + 
  facet_wrap(~subfield)
ggsave('discipline_viz.png',dpi=400)
cs_subfield_viz
```

```{r}
net_cs_completions <-  cs_completions %>% 
  filter(year == 2010 | year == 2018) %>%
  spread(year, completions)
colnames(net_cs_completions) <- c("subfield","cs_comps_2010","cs_comps_2018")
net_cs_completions <- net_cs_completions %>%
  mutate(percent_change = 100*(cs_comps_2018/cs_comps_2010-1), net_change = cs_comps_2018-cs_comps_2010) %>%
  arrange(desc(percent_change))
net_cs_completions$percent_change
v <- net_cs_completions$percent_change
v[1] <- 100
net_cs_completions$percent_change <- v
net_cs_completions
```
Let's visualize these growth rates: 
```{r}
pct_cs_scatter <- ggplot(data=net_cs_completions,aes(x=subfield,y=percent_change))+
  geom_text(aes(label=subfield))+
  theme(axis.text.x = element_text(angle = 90))+
  geom_hline(yintercept=0,color='red')+
  labs(title="CS subfield growth (by pct): 2010-2018",x='area',y='pct change in degree completions')+
  ggsave('cs_overall_pct_viz.png',dpi=400)+
  coord_flip()
pct_cs_scatter

pct_cs_scatter <- ggplot(data=net_cs_completions,aes(x=subfield,y=net_change))+
  theme(axis.text.x = element_text(angle = 90))+
  geom_hline(yintercept=0,color='red')+
  geom_label(aes(label=floor(net_change)))+
  labs(title="CS subfield growth (net): 2010-2018",x='area',y='net change in degree completions')+
  ggsave('cs_overall_net_viz.png',dpi=400)+
  coord_flip()
pct_cs_scatter
```

Let's filter this down a bit

```{r}
(best_net_cs <- net_cs_completions %>%
  arrange(desc(net_change)) %>%
  slice(1:10))

(best_pct_cs <- net_cs_completions %>%
  arrange(desc(percent_change)) %>%
  slice(1:10))
```


```{r}
best_net_cs_scatter <- ggplot(data=best_net_cs,aes(x=subfield,y=percent_change))+
  geom_text(aes(label=subfield))+
  theme(axis.text.x = element_text(angle = 90))+
  geom_hline(yintercept=0,color='red')+
  labs(title="Best CS subfield growth (by net): 2010-2018",x='area',y='pct change in degree completions')+
  ggsave('cs_fastest_net_viz.png',dpi=400)+
  coord_flip()
best_net_cs_scatter

best_pct_cs_scatter <- ggplot(data=best_pct_cs,aes(x=subfield,y=percent_change))+
  geom_text(aes(label=subfield))+
  theme(axis.text.x = element_text(angle = 90))+
  geom_hline(yintercept=0,color='red')+
  labs(title="Best CS subfield growth (by pct): 2010-2018",x='area',y='pct change in degree completions')+
  ggsave('cs_fastest_pct_viz.png',dpi=400)+
  coord_flip()
best_pct_cs_scatter
```

It's worth noting what the different codes are: 

Data modeling includes Data Warehousing and Database Admin

Modelling includs Virtual Environments and Simulation

Many informatics degrees are budding into data science disciplines (e.g. at the University of Washington). 

